While the first two – weight decay, and early stopping – are commonly used in many different learning techniques, the latter (to my knowledge) are more confined to neural networks and deep learning.
- Weight decay: These are among the most common techniques. The standard l2-norm which prevents weights from growing large, and l1-norm which induces sparsity (i.e. implicitly sets many weights very close to 0) can be included in this category.
- Early stopping: In this method, a development set, or a left-out part of the training data can be used to analyze the performance of the model. As training proceeds, typically the error on the training data will reduce. However, at some point the model starts overfitting and is indicated by an increase in error on the left-out data. This might be a good point to stop training.
- Weight sharing: This method forces many weights of the model to be the same thus making the model simpler and reducing overfitting.
- Model averaging: Here, we train multiple versions of the same model, with different initial conditions (or even multiple models) and then average them in anticipation that they will perform well together.
- Dropout (on Arxiv): A relatively new technique, developed in the context of neural nets. The idea is to drop connections between hidden units randomly, thus not relying too much on a set of specific connections. This promotes the units to work better alone, and in combinations of different other units.
Drop in a comment and please help add the others!