and everyone who has used gradient descent knows that picking a “good” is almost an art.
In this post, (content derived mainly from the lecture 6 of Neural Networks for Machine Learning course @ CourseEra) we will quickly see a few nice tricks to have a better and faster experience with gradient descent. (Learning Rate is now abbreviated as LR for simplicity).
- Scale and shift the inputs to ranges around -1 to 1 (or even decorrelate them).
- Start with small LR, test out on small set of data and choose appropriate value.
- For very large data sets use mini-batches or (in rare cases) online learning.
- Adapt the global LR over time, use sign of the gradient for this adaptation. If the sign stays same, increase LR, else decrease LR.
- Determine individual weight-specific factors to modify global learning rate, one way to do this is to check for a change in the sign of the local gradient.
- Use a velocity update instead of position (called Momentum). Here the update is not only current gradient, but influenced by (scaled) value of previous velocity. . A better version of the same is to apply the jump in the direction of accumulated velocity, and then make a correction by computing local gradient (inspired by Nesterov method).
- rprop: Only use the sign of the gradient, and adapt the step size separately for each weight (should not be used on mini-batches).
- rmsprop: mini-batch version of rprop, where the gradient is divided by a sqrt of the moving average of the squared gradient.
While all the above are enhancements to improve the learning, both in speed and performance, a recent paper from Yann LeCun’s group promises to not require setting learning rates anymore.