# Labeled data, Unlabeled data and Constraints

The standard paradigm of machine learning is to learn a model from some labeled “training” data $(X_l, y_l)$ and then apply this model on to some “test” data for which we wish to obtain labels. However, for most tasks where machine learning is applied, there is abundance of unlabeled data, and labeled data usually comes at a price (processing / manual annotations / sometimes even real money).

Recently there is a trend towards using unlabeled data $X_u$ in the learning step and terming the methods as semi-supervised learning. The idea here is to use the fact that there exist data points (even unlabeled ones) in that space, and that it is not just empty space.

Semi-supervised Learning decision boundary (Wiki)

The figure above captures this point very well. If we just had two labeled points (filled and empty circles), the decision boundary would be a simple straight line maximally separating the two (top half of the figure). However, the very knowledge of having more points in that banana-shaped space, allows us to form much more intricate boundaries to better classify the points. The most popular notion for semi-supervised learning is to place the decision boundary respecting the labeled points, AND in a space with the least density of points.

In our framework, we go a step further and extend this idea to include constraints between data points. These constraints typically say whether a pair of data points belongs to the same class (positive constraint) or they belong to different classes (negative constraint). This allows to further refine the boundaries. For example, in the figure below we have a 3 class problem where only the colored data points ($\bigcirc, +, \nabla$) are labeled. The others ($\times$) are all unlabeled. Compare how the decision boundaries change to satisfy the constraints and the unlabeled data.

Labeled data, Unlabeled data and Constraints

We formulate the problem using differentiable loss functions (making optimization easier and faster!) for each of the three components: labeled data, unlabeled data and constraints, and then apply it to improve the performance of face recognition in TV series. Labeled data is obtained from alignments between subtitles (what is spoken when) and transcripts (who speaks what) to obtain who speaks when. Unlabeled data is plenty! basically, all the face tracks which are not assigned a name. Currently we only use negative constraints between tracks occurring at the same time — two faces appearing at the same time should (mostly) belong to different people.

For more details, results, and analysis please refer to
M. Bäuml, M. Tapaswi, and R. Stiefelhagen. Semi-supervised Learning with Constraints for Person Identification in Multimedia Data. CVPR 2013. (download)