Reducing Overfitting

In any machine learning problem, there is a tendency for the model to overfit the data, and make it very very specific to the minute intricacies of the training set. This can be attributed to sampling error if the data is not a sufficient representation of the world. Overfitting is usually unfavourable, since the model does not generalize well on other (typically unseen) data. There are many techniques developed to prevent this sort of overfitting, here are a few of them.

While the first two – weight decay, and early stopping – are commonly used in many different learning techniques, the latter (to my knowledge) are more confined to neural networks and deep learning.

  • Weight decay: These are among the most common techniques. The standard l2-norm which prevents weights from growing large, and l1-norm which induces sparsity (i.e. implicitly sets many weights very close to 0) can be included in this category.
  • Early stopping: In this method, a development set, or a left-out part of the training data can be used to analyze the performance of the model. As training proceeds, typically the error on the training data will reduce. However, at some point the model starts overfitting and is indicated by an increase in error on the left-out data. This might be a good point to stop training.
  • Weight sharing: This method forces many weights of the model to be the same thus making the model simpler and reducing overfitting.
  • Model averaging: Here, we train multiple versions of the same model, with different initial conditions (or even multiple models) and then average them in anticipation that they will perform well together.
  • Dropout (on Arxiv): A relatively new technique, developed in the context of neural nets. The idea is to drop connections between hidden units randomly, thus not relying too much on a set of specific connections. This promotes the units to work better alone, and in combinations of different other units.

Drop in a comment and please help add the others!


Gradient Descent – Tips and Tricks

Gradient descent (steepest descent, gradient ascent, are all basically the same with a sign change) is still among the most simple and most popular optimization method out there, and works very well for minimization of convex functions. In essence, we update the parameters (w) of our model by the gradient of the error with respect to w and scale it with a learning rate \alpha.

\Delta w = -\alpha \nabla_w E

and everyone who has used gradient descent knows that picking a “good” \alpha is almost an art.

In this post, (content derived mainly from the lecture 6 of Neural Networks for Machine Learning course @ CourseEra) we will quickly see a few nice tricks to have a better and faster experience with gradient descent. (Learning Rate is now abbreviated as LR for simplicity).

  • Scale and shift the inputs to ranges around -1 to 1 (or even decorrelate them).
  • Start with small LR, test out on small set of data and choose appropriate value.
  • For very large data sets use mini-batches or (in rare cases) online learning.
  • Adapt the global LR over time, use sign of the gradient for this adaptation. If the sign stays same, increase LR, else decrease LR.
  • Determine individual weight-specific factors to modify global learning rate, one way to do this is to check for a change in the sign of the local gradient.
  • Use a velocity update instead of position (called Momentum). Here the update is not only current gradient, but influenced by (scaled) value of previous velocity. \Delta w(t) = v(t) = \beta v(t-1) - \alpha \nabla_w E(t). A better version of the same is to apply the jump in the direction of accumulated velocity, and then make a correction by computing local gradient (inspired by Nesterov method).
  • rprop: Only use the sign of the gradient, and adapt the step size separately for each weight (should not be used on mini-batches).
  • rmsprop: mini-batch version of rprop, where the gradient is divided by a sqrt of the moving average of the squared gradient.

While all the above are enhancements to improve the learning, both in speed and performance, a recent paper from Yann LeCun’s group promises to not require setting learning rates anymore.

Metric Learning

Metric learning is the new “fashionable” thing to do for many computer vision tasks. The main idea is relatively simple — modify the standard euclidean distance by some matrix that helps improve the discrimative performance (distance of features from the same class is small, while that from different classes is large). Yes, this is the Mahalanobis distance. In math,

d^2_\mathbf{M}(\mathbf{x}_i, \mathbf{x}_j) = (\mathbf{x}_i - \mathbf{x}_j)^T\mathbf{M}(\mathbf{x}_i - \mathbf{x}_j)

Some of the current well-performing learning techniques are

LMNN (Large Margin Nearest Neighbour)
[Weinberger, NIPS2006]
The idea is a combination of allowance of slack for different classes (like SVMs) with the local region of a k-nearest neighbour sphere. The formulation tries to increase similarity to target neighbours, while reducing it to impostors (other classes) lying within this k-nearest neighbour region.

ITML (Information Theoretic Metric Learning)
[Davis, ICML2007]
The problem is formulated as the minimization of differential relative entropy between two multivariate Gaussians under constraints of the distance function. The trade off lies in satisfying constraints while being close to a prior (usually identity — leading to euclidean distance). The constraints are to keep similar pairs below some distance d_{sim} and and dissimilar pairs above d_{diff}.

LDML (Logistic Discriminant Metric Learning)
[Guillaumin, ICCV2009]
In this method, a probabilistic approach is used. The probability of a pair being similar is modeled by a sigmoid function which includes a bias term, and the Mahalanobis distance. The metric is then estimated by gradient descent. p_{ij} = \sigma(b - d^2_M(x_i, x_j))

KISSME (Keep It Simple and Straightforward MEtric)
[Koestinger, CVPR2012]
This final approach, is a non-iterative and extremely fast method to learn a metric. It assumes that the feature space of both hypothesis (pair is similar or dissimilar) is Gaussian, and then formulates the metric as a difference of (inverted) covariance matrices. In their experiments, the assumption is satisfied by reducing the dimension of the features via PCA. M = \Sigma^{-1}_1 - \Sigma^{-1}_0

Go make your nearest neighbour search better.

Supervised and Unsupervised Learning

In this first post, we will address the most basic forms of learning techniques. For all the upcoming posts, the goal of learning is to find an appropriate function f:\mathcal{X}\rightarrow \mathcal{Y} which maps some input x_i in space \mathcal{X} to y_i in the label space \mathcal{Y}. Here, input space is the data (images, audio signals, etc.) or more typically some interesting features (SIFT, MFCCs, etc.) and label space is the list of possible classes. In person identification this would be the list of people names, while in binary classification like face or not-a-face we just have 2 possible labels.

In Supervised learning the machine is provided with many examples x_i along with the class they belong to y_i. It is among the simplest, most studied forms of learning, and there are many popular classifiers which can be trained. However, it is also the most prohibitive to scale up, since each new concept needs to be taught separately with many examples. A lot of manual effort is required in obtaining the class labels y_i. Typical supervised learning techniques include Nearest Neighbour classifiers, Support Vector Machines (max-margin classifiers), Random Forests (decision trees), Linear Regression (and other regressors), etc. A common problem is over-fitting to the training data, and is reduced by having a validation set (some data kept aside for tuning parameters).

On the opposite side is Unsupervised learning where the data comes with no labels at all. The goal here is to automatically learn some underlying structure in the data. Most work is on grouping similar data (in feature space \mathcal{X}) into clusters. The most popular method is K-Means where the data is sorted into “K” clusters. The other common technique is Agglomerative clustering which has the advantage of not requiring to specify the number “K”. However, it does need some other stoppage criteria like distance between clusters which are to be merged, or the tree depth. (See: Clustering and Matlab)

The evaluation of clustering is quite tricky, and one interesting way to do it is by checking manual effort required to label the data (Read: Unsupervised Metric Learning for Face Identification in TV Video, Cinbis et al. ICCV2011). Other methods in some way capture the entropy of clusters. Adjusted Rand index also seems to be a good way to measure clustering accuracy (but Rand index alone is bad!).

Log-Sum-Exp trick

It is quite common to process small numbers in probability (no really, so small that double precision 10^{-308} may not be enough). They are encountered while modeling with Gaussians, which are so ubiquitous today. However, it is only so much that one Gaussian can do, and hence the obvious extension is to use a mixture of Gaussians, weighted differently with different means and covariances.

The probability function for a mixture of multi-variate Gaussians looks like
p(x) = \sum_i w_i \frac{1}{\sqrt{(2\pi)^N |C_i|}} \exp[-\frac{1}{2} (x-\mu_i)^T C_i^{-1} (x-\mu_i)]
where both x and \mu_i are vectors of dimension N, and i a count of the number of Gaussians in the mixture.

Everyone knows the above equation, but the problem comes when actually trying to compute it. So typically papers report this Log-Likelihood, which is basically the \log(p(x)). Consider an example of 2 gaussians. Then, you may need to evaluate something like f(x) = \log(e^{-1000} + e^{-1001}). Those are not made up. It is quite easy to obtain a score of 1000. Direct evaluation will usually return -Inf, but we clearly know that it cannot be true, and has to be around -999.something.

The simple thing to do is rewrite f(x) as
\log(e^{-1000}(e^0 + e^{-1})) = -1000 + \log(1 + e^{-1}) \approx -999.7.

And that is the computational trick. Pull out the maximum, from the sum, leave the rest (which can be evaluated). Just FYI, the maximum value is exp(709) which lies just below the numerical Infinity.

Machine Learning

Many of the common computer or electronic technologies today — web search, speech recognition, language translation, gestures, recommender systems, etc. all can come under this vast area called Machine Learning. The grand goal is to develop artificial intelligence to emulate the human brain and many of its functions.

Traditionally, machine learning has been “supervised”. The machine is shown a picture of an apple, and told that it’s an apple, or is told a certain language is english by showing many example words, or that a certain audio signal is music and some other speech. However, since we want the machine to learn increasing number of concepts in various areas, providing examples of each is too much spoon feeding! More over, machines inherently being dull require a lot of examples, say 50 to 100 to really understand what an apple is, and yet apples are still quite symmetric. Deformable objects like people or animals are even harder.

This has led to so many types of learning, and in the coming months I expect to write a short note on each. The technical depth will obviously increase, and most examples will pertain to images and vision. Below is an expected list:

Supervised learning; Unsupervised learning; Semi-supervised learning; Weakly supervised learning; Multiple label learning; Multiple instance learning; Multi-instance multi-label learning; Ambiguously labeled data learning; Probably Approximately Correct (PAC) learning; Transfer learning; Zero-shot learning; One-shot learning; Curriculum learning; Active learning; Proactive learning; Metric learning; Structured learning; Reinforcement learning; Consistency learning.

I am pretty sure to have missed many more learning techniques out there, let me know!

Intuition behind Average Precision and MAP

Average Precision (AP), more commonly, further averaged over all queries and reported as a single score — Mean Average Precision (MAP) — is a very popular performance measure in information retrieval. However, it is quite tricky to interpret and compare the scores that are seen. From my observations, most hard challenges (TRECVID Semantic Indexing / Genre Classification) have very low (0.05 to 0.3) MAP scores (max 1).

This post is my view on what the score conveys. I will also try to interpret what is a “good” MAP score, and what is “bad”. While this depends on the application, it is still good to have an idea of what to expect.

So firstly, what is MAP, or AP. Suppose we are searching for images of a flower and we provide our image retrieval system a sample picture of a rose (query), we do get back a bunch of ranked images (from most likely to least likely). Usually not all of them are correct. So we compute the precision at every correctly returned image, and then take an average. If our returned result is
1, 0, 0, 1, 1, 1
where 1 is an image of a flower, while 0 not, then the precision at every correct point is: how many correct images have been encountered up to this point (including current) divided by the total images seen up to this point.
1/1, 0, 0, 2/4, 3/5, 4/6

The AP for above example is 0.6917 (see comments). 0.461, and although of the 6 images returned 4 are correct, the AP is less than half! This is precisely what makes the score somewhat hard to understand.

A simple way to interpret is to produce a combination of zeros and ones which will give the required AP.
For example, an AP of 0.5 could have results like
0, 1, 0, 1, 0, 1, ...
where every second image is correct, while an AP of 0.333 has
0, 0, 1, 0, 0, 1, 0, 0, 1, ...
where every third image is correct.

Hopefully, it is clear by now that for an AP of 0.1, every 10th image will be correct, and that is definitely a bad retrieval system. On the other hand, for an AP above 0.5, we will encounter more correct images than wrong in the top results which is definitely a good sign.

MAP is just an extension, where the mean is taken across all AP scores for many queries, and again the above interpretation should hold true. Does your system have a MAP > 0.5? 🙂

Further, I would consider an improvement in MAP score from 0.2 (every 5th) to 0.4 (every 2nd/3rd) as much more significant than an improvement from 0.7 to 0.9 when the results are already pretty good.

While, the possibility to estimate a MAP score given the (multi-class) classification accuracy still remains a mystery to me, any other interpretations are more than welcome!

UPDATE 1: Another measure, the Mean Reciprocal Rank is equivalent to MAP when the number of correct elements in the returned list is just 1. In that case, we simply use the precision at that point, or the reciprocal index of the correct location and average across all queries. Thanks to zhangmeng1010 for the tip.