In any machine learning problem, there is a tendency for the model to overfit the data, and make it very very specific to the minute intricacies of the training set. This can be attributed to sampling error if the data is not a sufficient representation of the world. Overfitting is usually unfavourable, since the model does not generalize well on other (typically unseen) data. There are many techniques developed to prevent this sort of overfitting, here are a few of them.
While the first two – weight decay, and early stopping – are commonly used in many different learning techniques, the latter (to my knowledge) are more confined to neural networks and deep learning.
- Weight decay: These are among the most common techniques. The standard l2-norm which prevents weights from growing large, and l1-norm which induces sparsity (i.e. implicitly sets many weights very close to 0) can be included in this category.
- Early stopping: In this method, a development set, or a left-out part of the training data can be used to analyze the performance of the model. As training proceeds, typically the error on the training data will reduce. However, at some point the model starts overfitting and is indicated by an increase in error on the left-out data. This might be a good point to stop training.
- Weight sharing: This method forces many weights of the model to be the same thus making the model simpler and reducing overfitting.
- Model averaging: Here, we train multiple versions of the same model, with different initial conditions (or even multiple models) and then average them in anticipation that they will perform well together.
- Dropout (on Arxiv): A relatively new technique, developed in the context of neural nets. The idea is to drop connections between hidden units randomly, thus not relying too much on a set of specific connections. This promotes the units to work better alone, and in combinations of different other units.
Drop in a comment and please help add the others!