Gradient Descent – Tips and Tricks

Gradient descent (steepest descent, gradient ascent, are all basically the same with a sign change) is still among the most simple and most popular optimization method out there, and works very well for minimization of convex functions. In essence, we update the parameters (w) of our model by the gradient of the error with respect to w and scale it with a learning rate \alpha.

\Delta w = -\alpha \nabla_w E

and everyone who has used gradient descent knows that picking a “good” \alpha is almost an art.

In this post, (content derived mainly from the lecture 6 of Neural Networks for Machine Learning course @ CourseEra) we will quickly see a few nice tricks to have a better and faster experience with gradient descent. (Learning Rate is now abbreviated as LR for simplicity).

  • Scale and shift the inputs to ranges around -1 to 1 (or even decorrelate them).
  • Start with small LR, test out on small set of data and choose appropriate value.
  • For very large data sets use mini-batches or (in rare cases) online learning.
  • Adapt the global LR over time, use sign of the gradient for this adaptation. If the sign stays same, increase LR, else decrease LR.
  • Determine individual weight-specific factors to modify global learning rate, one way to do this is to check for a change in the sign of the local gradient.
  • Use a velocity update instead of position (called Momentum). Here the update is not only current gradient, but influenced by (scaled) value of previous velocity. \Delta w(t) = v(t) = \beta v(t-1) - \alpha \nabla_w E(t). A better version of the same is to apply the jump in the direction of accumulated velocity, and then make a correction by computing local gradient (inspired by Nesterov method).
  • rprop: Only use the sign of the gradient, and adapt the step size separately for each weight (should not be used on mini-batches).
  • rmsprop: mini-batch version of rprop, where the gradient is divided by a sqrt of the moving average of the squared gradient.

While all the above are enhancements to improve the learning, both in speed and performance, a recent paper from Yann LeCun’s group promises to not require setting learning rates anymore.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s