# Metric Learning

Metric learning is the new “fashionable” thing to do for many computer vision tasks. The main idea is relatively simple — modify the standard euclidean distance by some matrix that helps improve the discrimative performance (distance of features from the same class is small, while that from different classes is large). Yes, this is the Mahalanobis distance. In math,

$d^2_\mathbf{M}(\mathbf{x}_i, \mathbf{x}_j) = (\mathbf{x}_i - \mathbf{x}_j)^T\mathbf{M}(\mathbf{x}_i - \mathbf{x}_j)$

Some of the current well-performing learning techniques are

LMNN (Large Margin Nearest Neighbour)
[Weinberger, NIPS2006]
The idea is a combination of allowance of slack for different classes (like SVMs) with the local region of a k-nearest neighbour sphere. The formulation tries to increase similarity to target neighbours, while reducing it to impostors (other classes) lying within this k-nearest neighbour region.

ITML (Information Theoretic Metric Learning)
[Davis, ICML2007]
The problem is formulated as the minimization of differential relative entropy between two multivariate Gaussians under constraints of the distance function. The trade off lies in satisfying constraints while being close to a prior (usually identity — leading to euclidean distance). The constraints are to keep similar pairs below some distance $d_{sim}$ and and dissimilar pairs above $d_{diff}$.

LDML (Logistic Discriminant Metric Learning)
[Guillaumin, ICCV2009]
In this method, a probabilistic approach is used. The probability of a pair being similar is modeled by a sigmoid function which includes a bias term, and the Mahalanobis distance. The metric is then estimated by gradient descent. $p_{ij} = \sigma(b - d^2_M(x_i, x_j))$

KISSME (Keep It Simple and Straightforward MEtric)
[Koestinger, CVPR2012]
This final approach, is a non-iterative and extremely fast method to learn a metric. It assumes that the feature space of both hypothesis (pair is similar or dissimilar) is Gaussian, and then formulates the metric as a difference of (inverted) covariance matrices. In their experiments, the assumption is satisfied by reducing the dimension of the features via PCA. $M = \Sigma^{-1}_1 - \Sigma^{-1}_0$

Go make your nearest neighbour search better.