# Generate Positive and Negative Pairs given Identities

This post is about generating positive pairs of features (in-class data points) and negative pairs (out-of-class data points) for tasks such as Metric Learning or simply learning classifiers. We just need 7 lines of Matlab code to generate these pairs by relying on the concept of lower / upper triangular matrices and block diagonal matrices.

So here’s the code! Given a list of ground truth identities ids we first generate the unique identities uids, just to know the number of classes. For each class, we count the number of data points that belong to the class sum(uids(k) == ids) and generate a matrix of all ones.

To find negative pairs, we create a block diagonal matrix using the all ones matrices from each class, and then simply logically negate it to get a 1 in the position of out-of-class points and 0 in in-class points. Selecting a lower triangular after this simply removes the duplicate pairs. tril(~blkdiag(onesmats{:})).

For the positive pairs, we need to have a 1 in the intra-class, and to remove duplicates we again use the upper triangular matrix, negate it thus keeping a lower triangular without the diagonal 😉 ~triu(ones(X)). All pairs can be found by directly concatenating the matrices as block diagonals.
 uids = unique(ids); for k = 1:length(uids)     onesmats{k} = ones(sum(uids(k) == ids));     pospairmat{k} = ~triu(onesmats{k}); end [pospairs(:, 1), pospairs(:, 2)] = find(blkdiag(pospairmat{:})); [negpairs(:, 1), negpairs(:, 2)] = find(tril(~blkdiag(onesmats{:}))); 

Finally a simple [x, y] = find() allows to get the row, column indices to find the pairs. Suggestions for shorter, faster and smarter techniques is more than welcome in the comments.

PS: The code assumes that the ids are sorted, i.e. features of the same class occur next to each other. If this is not your case, it’s easy to do it just by running a sort and re-indexing the feature vectors.

# Labeled data, Unlabeled data and Constraints

The standard paradigm of machine learning is to learn a model from some labeled “training” data $(X_l, y_l)$ and then apply this model on to some “test” data for which we wish to obtain labels. However, for most tasks where machine learning is applied, there is abundance of unlabeled data, and labeled data usually comes at a price (processing / manual annotations / sometimes even real money).

Recently there is a trend towards using unlabeled data $X_u$ in the learning step and terming the methods as semi-supervised learning. The idea here is to use the fact that there exist data points (even unlabeled ones) in that space, and that it is not just empty space.

Semi-supervised Learning decision boundary (Wiki)

The figure above captures this point very well. If we just had two labeled points (filled and empty circles), the decision boundary would be a simple straight line maximally separating the two (top half of the figure). However, the very knowledge of having more points in that banana-shaped space, allows us to form much more intricate boundaries to better classify the points. The most popular notion for semi-supervised learning is to place the decision boundary respecting the labeled points, AND in a space with the least density of points.

In our framework, we go a step further and extend this idea to include constraints between data points. These constraints typically say whether a pair of data points belongs to the same class (positive constraint) or they belong to different classes (negative constraint). This allows to further refine the boundaries. For example, in the figure below we have a 3 class problem where only the colored data points ($\bigcirc, +, \nabla$) are labeled. The others ($\times$) are all unlabeled. Compare how the decision boundaries change to satisfy the constraints and the unlabeled data.

Labeled data, Unlabeled data and Constraints

We formulate the problem using differentiable loss functions (making optimization easier and faster!) for each of the three components: labeled data, unlabeled data and constraints, and then apply it to improve the performance of face recognition in TV series. Labeled data is obtained from alignments between subtitles (what is spoken when) and transcripts (who speaks what) to obtain who speaks when. Unlabeled data is plenty! basically, all the face tracks which are not assigned a name. Currently we only use negative constraints between tracks occurring at the same time — two faces appearing at the same time should (mostly) belong to different people.

For more details, results, and analysis please refer to
M. Bäuml, M. Tapaswi, and R. Stiefelhagen. Semi-supervised Learning with Constraints for Person Identification in Multimedia Data. CVPR 2013. (download)

# Reducing Overfitting

In any machine learning problem, there is a tendency for the model to overfit the data, and make it very very specific to the minute intricacies of the training set. This can be attributed to sampling error if the data is not a sufficient representation of the world. Overfitting is usually unfavourable, since the model does not generalize well on other (typically unseen) data. There are many techniques developed to prevent this sort of overfitting, here are a few of them.

While the first two – weight decay, and early stopping – are commonly used in many different learning techniques, the latter (to my knowledge) are more confined to neural networks and deep learning.

• Weight decay: These are among the most common techniques. The standard l2-norm which prevents weights from growing large, and l1-norm which induces sparsity (i.e. implicitly sets many weights very close to 0) can be included in this category.
• Early stopping: In this method, a development set, or a left-out part of the training data can be used to analyze the performance of the model. As training proceeds, typically the error on the training data will reduce. However, at some point the model starts overfitting and is indicated by an increase in error on the left-out data. This might be a good point to stop training.
• Weight sharing: This method forces many weights of the model to be the same thus making the model simpler and reducing overfitting.
• Model averaging: Here, we train multiple versions of the same model, with different initial conditions (or even multiple models) and then average them in anticipation that they will perform well together.
• Dropout (on Arxiv): A relatively new technique, developed in the context of neural nets. The idea is to drop connections between hidden units randomly, thus not relying too much on a set of specific connections. This promotes the units to work better alone, and in combinations of different other units.

# Gradient Descent – Tips and Tricks

Gradient descent (steepest descent, gradient ascent, are all basically the same with a sign change) is still among the most simple and most popular optimization method out there, and works very well for minimization of convex functions. In essence, we update the parameters of our model by the gradient of the error with respect to w and scale it with a learning rate $\alpha$.

$\Delta w = -\alpha \nabla_w E$

and everyone who has used gradient descent knows that picking a “good” $\alpha$ is almost an art.

In this post, (content derived mainly from the lecture 6 of Neural Networks for Machine Learning course @ CourseEra) we will quickly see a few nice tricks to have a better and faster experience with gradient descent. (Learning Rate is now abbreviated as LR for simplicity).

• Scale and shift the inputs to ranges around -1 to 1 (or even decorrelate them).
• Start with small LR, test out on small set of data and choose appropriate value.
• For very large data sets use mini-batches or (in rare cases) online learning.
• Adapt the global LR over time, use sign of the gradient for this adaptation. If the sign stays same, increase LR, else decrease LR.
• Determine individual weight-specific factors to modify global learning rate, one way to do this is to check for a change in the sign of the local gradient.
• Use a velocity update instead of position (called Momentum). Here the update is not only current gradient, but influenced by (scaled) value of previous velocity. $\Delta w(t) = v(t) = \beta v(t-1) - \alpha \nabla_w E(t)$. A better version of the same is to apply the jump in the direction of accumulated velocity, and then make a correction by computing local gradient (inspired by Nesterov method).
• rprop: Only use the sign of the gradient, and adapt the step size separately for each weight (should not be used on mini-batches).
• rmsprop: mini-batch version of rprop, where the gradient is divided by a sqrt of the moving average of the squared gradient.

While all the above are enhancements to improve the learning, both in speed and performance, a recent paper from Yann LeCun’s group promises to not require setting learning rates anymore.

# Metric Learning

Metric learning is the new “fashionable” thing to do for many computer vision tasks. The main idea is relatively simple — modify the standard euclidean distance by some matrix that helps improve the discrimative performance (distance of features from the same class is small, while that from different classes is large). Yes, this is the Mahalanobis distance. In math,

$d^2_\mathbf{M}(\mathbf{x}_i, \mathbf{x}_j) = (\mathbf{x}_i - \mathbf{x}_j)^T\mathbf{M}(\mathbf{x}_i - \mathbf{x}_j)$

Some of the current well-performing learning techniques are

LMNN (Large Margin Nearest Neighbour)
[Weinberger, NIPS2006]
The idea is a combination of allowance of slack for different classes (like SVMs) with the local region of a k-nearest neighbour sphere. The formulation tries to increase similarity to target neighbours, while reducing it to impostors (other classes) lying within this k-nearest neighbour region.

ITML (Information Theoretic Metric Learning)
[Davis, ICML2007]
The problem is formulated as the minimization of differential relative entropy between two multivariate Gaussians under constraints of the distance function. The trade off lies in satisfying constraints while being close to a prior (usually identity — leading to euclidean distance). The constraints are to keep similar pairs below some distance $d_{sim}$ and and dissimilar pairs above $d_{diff}$.

LDML (Logistic Discriminant Metric Learning)
[Guillaumin, ICCV2009]
In this method, a probabilistic approach is used. The probability of a pair being similar is modeled by a sigmoid function which includes a bias term, and the Mahalanobis distance. The metric is then estimated by gradient descent. $p_{ij} = \sigma(b - d^2_M(x_i, x_j))$

KISSME (Keep It Simple and Straightforward MEtric)
[Koestinger, CVPR2012]
This final approach, is a non-iterative and extremely fast method to learn a metric. It assumes that the feature space of both hypothesis (pair is similar or dissimilar) is Gaussian, and then formulates the metric as a difference of (inverted) covariance matrices. In their experiments, the assumption is satisfied by reducing the dimension of the features via PCA. $M = \Sigma^{-1}_1 - \Sigma^{-1}_0$

Go make your nearest neighbour search better.

# Supervised and Unsupervised Learning

In this first post, we will address the most basic forms of learning techniques. For all the upcoming posts, the goal of learning is to find an appropriate function $f:\mathcal{X}\rightarrow \mathcal{Y}$ which maps some input $x_i$ in space $\mathcal{X}$ to $y_i$ in the label space $\mathcal{Y}$. Here, input space is the data (images, audio signals, etc.) or more typically some interesting features (SIFT, MFCCs, etc.) and label space is the list of possible classes. In person identification this would be the list of people names, while in binary classification like face or not-a-face we just have 2 possible labels.

In Supervised learning the machine is provided with many examples $x_i$ along with the class they belong to $y_i$. It is among the simplest, most studied forms of learning, and there are many popular classifiers which can be trained. However, it is also the most prohibitive to scale up, since each new concept needs to be taught separately with many examples. A lot of manual effort is required in obtaining the class labels $y_i$. Typical supervised learning techniques include Nearest Neighbour classifiers, Support Vector Machines (max-margin classifiers), Random Forests (decision trees), Linear Regression (and other regressors), etc. A common problem is over-fitting to the training data, and is reduced by having a validation set (some data kept aside for tuning parameters).

On the opposite side is Unsupervised learning where the data comes with no labels at all. The goal here is to automatically learn some underlying structure in the data. Most work is on grouping similar data (in feature space $\mathcal{X}$) into clusters. The most popular method is K-Means where the data is sorted into “K” clusters. The other common technique is Agglomerative clustering which has the advantage of not requiring to specify the number “K”. However, it does need some other stoppage criteria like distance between clusters which are to be merged, or the tree depth. (See: Clustering and Matlab)

The evaluation of clustering is quite tricky, and one interesting way to do it is by checking manual effort required to label the data (Read: Unsupervised Metric Learning for Face Identification in TV Video, Cinbis et al. ICCV2011). Other methods in some way capture the entropy of clusters. Adjusted Rand index also seems to be a good way to measure clustering accuracy (but Rand index alone is bad!).

# Machine Learning

Many of the common computer or electronic technologies today — web search, speech recognition, language translation, gestures, recommender systems, etc. all can come under this vast area called Machine Learning. The grand goal is to develop artificial intelligence to emulate the human brain and many of its functions.

Traditionally, machine learning has been “supervised”. The machine is shown a picture of an apple, and told that it’s an apple, or is told a certain language is english by showing many example words, or that a certain audio signal is music and some other speech. However, since we want the machine to learn increasing number of concepts in various areas, providing examples of each is too much spoon feeding! More over, machines inherently being dull require a lot of examples, say 50 to 100 to really understand what an apple is, and yet apples are still quite symmetric. Deformable objects like people or animals are even harder.

This has led to so many types of learning, and in the coming months I expect to write a short note on each. The technical depth will obviously increase, and most examples will pertain to images and vision. Below is an expected list:

Supervised learning; Unsupervised learning; Semi-supervised learning; Weakly supervised learning; Multiple label learning; Multiple instance learning; Multi-instance multi-label learning; Ambiguously labeled data learning; Probably Approximately Correct (PAC) learning; Transfer learning; Zero-shot learning; One-shot learning; Curriculum learning; Active learning; Proactive learning; Metric learning; Structured learning; Reinforcement learning; Consistency learning.

I am pretty sure to have missed many more learning techniques out there, let me know!