# Bézier curves off the Diagonal

We all know that we can drag the center point of a diagonal in a plot and then move it around to make beautiful arcs. However, how are they mathematically generated and can they be easily controlled by a single parameter?

Bézier curves for $\alpha$ parameter 0, 0.25, 0.5, 0.75 and 1

The answers to both above questions is yes! The above curves are essentially quadratic Bézier curves with three control points. All of them have a relatively easy to pick start and end point. In the examples above we pick (1, 1) and (100, 100). However for any general row count $N_R$ and column count $N_C$, the end point can $(N_C, N_R)$ — points denoted by $(x, y)$ convention which is columns first.

Now that we have two points, we can talk about the third control point. When this is the mid-point of the space $(0.5 \cdot N_C, 0.5 \cdot N_R)$ the curve is a straight diagonal line as all points are collinear. However add an imbalance and we get deviations from the diagonal. For simplicity, these deviations can be parametrized by a single variable $\alpha$, which makes the third point $(\alpha N_C, (1-\alpha) N_R)$.

The equations of the curves correspond to rational Bézier curves with three points which have the additional property of also representing conic sections

$\mathbf{B}(t) = \frac{\sum_{i=0}^2 {2 \choose i} t^i (1-t)^{2-i}\mathbf{P}_{i} } { \sum_{i=0}^2 {2 \choose i} t^i (1-t)^{2-i} }$

where the value $t$ sweeps from 0 to 1 in small steps to get the $(x, y)$ coordinates at each step.

An application of this curve (my use-case) is to model the number of sentences (space) used to describe a movie / TV episode in a plot synopsis (plotted on the y-axis) corresponding to the shots (plotted on the x-axis). Empirical evidence suggests that authors spend more space describing the latter half of a video as it approaches the climax, thus allowing us to use $\alpha \sim 0.65$ as a model which fits better than the diagonal.

# Affinity Diffusion

This post is based on a very simple, yet powerful technique presented at CVPR 2013. The authors essentially compare many existing graph-based diffusion techniques, generalize them and evaluate them to pick the best alternative. The goal — to improve the quality of affinity matrices.

M. Donoser and H. Bischof. Diffusion Processes for Retrieval Revisited. CVPR 2013. [paper]

Given $N$ data points $x_1, \ldots, x_N$ which can represent anything (objects, faces, actions, words, audio segments, etc.) we first build a pair-wise affinity matrix using an appropriate function $a_{ij} = f(x_i, x_j)$ that captures their similarity. This affinity matrix (or its inverse, the distance matrix — which captures pair-wise differences/dissimilarities) is very commonly used in hierarchical clustering, or retrieval applications. In fact, it can also be seen as a kernel matrix where the function $f$ is the kernel function.

Since many tasks such as clustering / retrieval / classification are based on this matrix, improving its quality, i.e. making similarity between elements from the same class higher and between different classes lower, can help improve performance drastically. That is precisely what this paper demonstrates, and although the MPEG-7 bulls-eye score may be a controversial measure to capture this performance improvement, an improvement in mAP is also observed.

So what is the magic that improves the affinity? It boils down to “elements similar to my neighbors are probably also similar to me”. Neighbors (in the nearest-neighbor sense) are elements that are similar to a selected element. In math, it essentially involves iterating over matrix multiplications a few times (5-10). The best combination that the authors pick is to initialize the iterations with the $P_{kNN}$ a sparse K-NN matrix obtained from the original affinity matrix, and to update it by pre- and post-multiplication with itself until convergence. The update step is
$A^{t+1} = P_{kNN}*A^t*P_{kNN}^T$.
That’s it. The resulting converged matrix is then used for the application and shows improved performance.

Apart from the 3 data sets where the authors show that the above works, we also quickly tried this algorithm on face-based person retrieval and saw an improvement in mAP from 0.463 to 0.648! That is huge!

The authors also provide easy-to-use Matlab code to replicate their results, and use it for your problem.

# Labeled data, Unlabeled data and Constraints

The standard paradigm of machine learning is to learn a model from some labeled “training” data $(X_l, y_l)$ and then apply this model on to some “test” data for which we wish to obtain labels. However, for most tasks where machine learning is applied, there is abundance of unlabeled data, and labeled data usually comes at a price (processing / manual annotations / sometimes even real money).

Recently there is a trend towards using unlabeled data $X_u$ in the learning step and terming the methods as semi-supervised learning. The idea here is to use the fact that there exist data points (even unlabeled ones) in that space, and that it is not just empty space.

Semi-supervised Learning decision boundary (Wiki)

The figure above captures this point very well. If we just had two labeled points (filled and empty circles), the decision boundary would be a simple straight line maximally separating the two (top half of the figure). However, the very knowledge of having more points in that banana-shaped space, allows us to form much more intricate boundaries to better classify the points. The most popular notion for semi-supervised learning is to place the decision boundary respecting the labeled points, AND in a space with the least density of points.

In our framework, we go a step further and extend this idea to include constraints between data points. These constraints typically say whether a pair of data points belongs to the same class (positive constraint) or they belong to different classes (negative constraint). This allows to further refine the boundaries. For example, in the figure below we have a 3 class problem where only the colored data points ($\bigcirc, +, \nabla$) are labeled. The others ($\times$) are all unlabeled. Compare how the decision boundaries change to satisfy the constraints and the unlabeled data.

Labeled data, Unlabeled data and Constraints

We formulate the problem using differentiable loss functions (making optimization easier and faster!) for each of the three components: labeled data, unlabeled data and constraints, and then apply it to improve the performance of face recognition in TV series. Labeled data is obtained from alignments between subtitles (what is spoken when) and transcripts (who speaks what) to obtain who speaks when. Unlabeled data is plenty! basically, all the face tracks which are not assigned a name. Currently we only use negative constraints between tracks occurring at the same time — two faces appearing at the same time should (mostly) belong to different people.

For more details, results, and analysis please refer to
M. Bäuml, M. Tapaswi, and R. Stiefelhagen. Semi-supervised Learning with Constraints for Person Identification in Multimedia Data. CVPR 2013. (download)

# Intuition behind Average Precision and MAP

Average Precision (AP), more commonly, further averaged over all queries and reported as a single score — Mean Average Precision (MAP) — is a very popular performance measure in information retrieval. However, it is quite tricky to interpret and compare the scores that are seen. From my observations, most hard challenges (TRECVID Semantic Indexing / Genre Classification) have very low (0.05 to 0.3) MAP scores (max 1).

This post is my view on what the score conveys. I will also try to interpret what is a “good” MAP score, and what is “bad”. While this depends on the application, it is still good to have an idea of what to expect.

So firstly, what is MAP, or AP. Suppose we are searching for images of a flower and we provide our image retrieval system a sample picture of a rose (query), we do get back a bunch of ranked images (from most likely to least likely). Usually not all of them are correct. So we compute the precision at every correctly returned image, and then take an average. If our returned result is
1, 0, 0, 1, 1, 1
where 1 is an image of a flower, while 0 not, then the precision at every correct point is: how many correct images have been encountered up to this point (including current) divided by the total images seen up to this point.
1/1, 0, 0, 2/4, 3/5, 4/6

The AP for above example is 0.6917 (see comments). 0.461, and although of the 6 images returned 4 are correct, the AP is less than half! This is precisely what makes the score somewhat hard to understand.

A simple way to interpret is to produce a combination of zeros and ones which will give the required AP.
For example, an AP of 0.5 could have results like
0, 1, 0, 1, 0, 1, ...
where every second image is correct, while an AP of 0.333 has
0, 0, 1, 0, 0, 1, 0, 0, 1, ...
where every third image is correct.

Hopefully, it is clear by now that for an AP of 0.1, every 10th image will be correct, and that is definitely a bad retrieval system. On the other hand, for an AP above 0.5, we will encounter more correct images than wrong in the top results which is definitely a good sign.

MAP is just an extension, where the mean is taken across all AP scores for many queries, and again the above interpretation should hold true. Does your system have a MAP > 0.5? 🙂

Further, I would consider an improvement in MAP score from 0.2 (every 5th) to 0.4 (every 2nd/3rd) as much more significant than an improvement from 0.7 to 0.9 when the results are already pretty good.

While, the possibility to estimate a MAP score given the (multi-class) classification accuracy still remains a mystery to me, any other interpretations are more than welcome!

UPDATE 1: Another measure, the Mean Reciprocal Rank is equivalent to MAP when the number of correct elements in the returned list is just 1. In that case, we simply use the precision at that point, or the reciprocal index of the correct location and average across all queries. Thanks to zhangmeng1010 for the tip.

# Manifold Learning

Machine learning has faced a strange curse of dimensionality, where it becomes difficult to compare data points when they are really long vectors. This arises from the fact that most distances are normally computed in the Euclidean space and this gets quite complicated (see explanation of hypersphere/hypercube).

There are many dimensionality reduction techniques, the most famous being Principal Component Analysis which in fact works quite well when the data points are somewhat linearly distributed. However it produces disastrous results in case of non-linear distributions.

What is a manifold? The underlying idea is that in very large dimensional spaces, data points of physical processes usually tend to occur in restricted regions, and are almost never uniformly distributed through the entire space. A manifold is an imaginary surface on which these data-points are assumed to lie. Furthermore the key concept is that this manifold has very few degrees of freedom, and hence can be represented in a low-dimensional space.

The distance between the data points is now computed as the shortest distance while tracing along the manifold. For example, consider our Earth. To go from Delhi to New York you could drill through the ground and obtain the shortest distance in 3-D Euclidean space. However this is infeasible and the true shortest distance is actually the geodesic distance, the one when traveling on the surface of Earth. Here our Earth’s surface is our manifold in the large universe!

Given the data points, the objective is to find this manifold so that data-points close to each other on the manifold surface can be classified as belonging together instead of the ones which are close in Euclidean space. However finding this exact geometric shape is truly a Herculean task in most cases. Hence, the objective is transformed to find a function which can map the data from the high dimensional space to a simpler low dimensional space, such that the distance between the points in the low dimension corresponds to the geodesic distance as computed on the manifold.

Techniques such as ISOMAP and LLE developed in 2000 created quite a stir and were published in Science!! Here is a pretty interesting tutorial on manifolds, from which the above gist appears.

# Clustering and Matlab

Clustering data is a useful technique for compact representation (vector quantization), statistics (mean, variance of group of data) and pattern recognition(unsupervised classification). In this post, we shall briefly see the two major types of clustering techniques, and then look at how easily Matlab deals with them.

One class of the techniques is Hierarchical, usually Agglomerative clustering. As is clear from the words itself, agglomerative clustering involves grouping data points most “near” to each other. The measure of nearness is defined by a pre-defined measure of distance (usually Euclidean). A tree structure is built and we move from each data point being its own cluster to a 1-cluster system. The user then has to decide at which point should the clustering process stop. For example, one could stop the system when the next agglomeration involves combining clusters some x or more distance away. The other way would be to define a maxclusters criterion. However, as we shall see further that sort of defeats the purpose of hierarchical clustering.

In Matlab,
T = clusterdata(X,'cutoffType',cutoffThreshold) does all the clustering work and returns the cluster classes in T. However more insight can be obtained by performing each task individually.

1. Y = pdist(X,'type') computes distance between data points
2. Z = linkage(Y,'average') gets the tree linkages matrix
3. T = cluster(Z,'cutoffType',cutoffThreshold) does the actual clustering

The major advantage of breaking the steps is the ability to view awesome graphs, and observe the goodness of your clustering. dendrogram(Z,'colorThreshold','default') shows the tree structure with each cluster getting a unique colour. One should also try out the silhouette() and cophenet() functions.

The other major group is the standard Partition clustering. The classical algorithm is the K-Means where K clusters are formed. In this the user fixes the number of clusters at the start. Random initial cluster centers are picked (usually from the data points). The iterative process then involves assigning “nearest” points to the cluster center, followed by an update of the cluster center itself. This is repeated until convergence, usually such that the cluster assignment does not change (if data size is tractable).

In Matlab, the command [T, C] = kmeans(X,k) clusters the data points X into k groups, returning the centers C and the target assignment T. More info

The difference between the two types is obvious. Partitional requires defining the number of clusters while the Hierarchical method needs the user to specify some sort of cutoff threshold to stop the agglomeration process. One needs to decide what is the more suitable criteria for a given problem.

PS: You can try out the functions by creating some data on your own by using mean-shifted Gaussians.
X = [-2+randn(100,1); randn(100,1); 2+randn(100,1)];

# Markov Random Fields

Graphical models are an interesting approach to pattern recognition. Most pattern recognition problems can be simplified to computation of the posterior probability of a class, given an observation. This conditional probability is represented in terms of a graphical model. Although both directed and undirected graphs can be used, Markov Random Fields (MRFs) deal with undirected graphs.

Their main idea is quite a simple one. First, the problem is converted to a graph. An example would be to associate each random variable or observation with a node, and join edges of those nodes which seem to be related with each other. Then, cliques of this graph are considered together. The maximal cliques in the graph (taking into account each node in atleast one of them) are associated with a strictly positive potential function. The joint probability of the model is then just a normalised product of the individual clique potential functions.

Now, since we require positive potential functions, a typical example is to use the exponential function. They are used as $\phi_{X_C} = e^{-E(X_C)}$ where E is the “energy function”. It is easy to see that the joint probability (a product of individual potentials $\phi_{X_C}$) is now the exponential of a sum of energies; and following the entropy concept, we have the lower energies associated with higher probabilities.

These energy functions are chosen for each clique such that the desired response has lower energy, and the undesired has higher energy. Typically they could be functions of the random variables involved in that clique. They also include “constant” parameters (weights) so that more important cliques can exert dominance on the total probability of the model. These parameters then are learnt.

MRFs seem to be a really nice way of combining multiple observations or pattern recognition tasks intended towards a certain goal. The probabilistic approach followed here can be expected to beat the heuristics that would otherwise be employed. More on algorithms for training and testing in posts to come!

The book on Pattern Recognition and Machine Learning by Bishop, provides a brief idea about Graphical Models in Chapter 8 and is provided as the sample chapter.