Matlab interface with JSON data

Interfacing with different types of data is important, and JSON (for whatever reason) seems to be a popular format. For the past couple of years we have used JSONLab Mathworks File Exchange like many other folks out there, and yes it is good!

However, one of the problems with the JSON format is that it repeats the key (as in key-value concept) for every element and is thus quite redundant. This also makes the JSON file bulky, and complicated deep JSON files (lists of dictionaries of lists of dictionaries, you get the idea…) take forever to parse.

Thanks to Christian Panton, there is now a matlab-json repository on Github which uses the json C/C++ libraries, creates a MEX file, and thus makes reading a JSON file very fast.

All timing is computed as the average of 10 runs when not mentioned otherwise. It includes the time taken to open and read the JSON file (both use fscanf in Matlab). For a relatively small JSON file 9.3KB which has a simple structure array, JSONLab takes roughly 220 milliseconds, while matlab-json takes 9.6 milliseconds. That’s a massive improvement for a small file. For a file with complexity depth 2 (structure array of structure) and size 9.0MB, JSONLab takes 4869 seconds! (that’s 1h20min, computed on 1 run) while matlab-json takes just 1.54 seconds.

The repository includes a binary for windows (I haven’t tested) and is easy to compile for linux. Just install libjsoncpp-dev, libjson-c-dev and you are good to go with the make.m. To maintain compatibility with JSONLab’s “loadjson” which can directly take a filename as input, you may need to wrap matlab-json’s “fromjson” to try and read from file first.

You can now use big and ugly JSON files with your Matlab.

Generate Positive and Negative Pairs given Identities

This post is about generating positive pairs of features (in-class data points) and negative pairs (out-of-class data points) for tasks such as Metric Learning or simply learning classifiers. We just need 7 lines of Matlab code to generate these pairs by relying on the concept of lower / upper triangular matrices and block diagonal matrices.

So here’s the code! Given a list of ground truth identities ids we first generate the unique identities uids, just to know the number of classes. For each class, we count the number of data points that belong to the class sum(uids(k) == ids) and generate a matrix of all ones.

To find negative pairs, we create a block diagonal matrix using the all ones matrices from each class, and then simply logically negate it to get a 1 in the position of out-of-class points and 0 in in-class points. Selecting a lower triangular after this simply removes the duplicate pairs. tril(~blkdiag(onesmats{:})).

For the positive pairs, we need to have a 1 in the intra-class, and to remove duplicates we again use the upper triangular matrix, negate it thus keeping a lower triangular without the diagonal 😉 ~triu(ones(X)). All pairs can be found by directly concatenating the matrices as block diagonals.

uids = unique(ids);
for k = 1:length(uids)
    onesmats{k} = ones(sum(uids(k) == ids));
    pospairmat{k} = ~triu(onesmats{k});
end
[pospairs(:, 1), pospairs(:, 2)] = find(blkdiag(pospairmat{:}));
[negpairs(:, 1), negpairs(:, 2)] = find(tril(~blkdiag(onesmats{:})));

Finally a simple [x, y] = find() allows to get the row, column indices to find the pairs. Suggestions for shorter, faster and smarter techniques is more than welcome in the comments.

PS: The code assumes that the ids are sorted, i.e. features of the same class occur next to each other. If this is not your case, it’s easy to do it just by running a sort and re-indexing the feature vectors.

On Norms…

The mathematical norm of a vector (or a matrix) is associated with interesting properties for that vector (matrix) and trying to minimize it with other loss function terms or constraints yields interesting properties.

Vector Norms
The classical vector norms are \ell_0, \ell_1, \ell_2, \ell_\infty and all of them have some very interesting properties.

L2 norm: Probably the most common among them is the \ell_2 norm which is used in regularization or normalization. Minimization of the \ell_2 norm of a vector, usually a weight vector in classifiers such as SVMs, linear / logistic regression, etc. prevents imbalance and makes sure that no single value is very high compared to the others. The minimization problems are differentiable, convex and basically nice!

L-Inf norm: The \ell_\infty norm of a vector is the maximum absolute value of the vector and thus its minimization imposes limits on the range of numbers a vector can take.

L0 norm: The \ell_0 norm counts the number of non-zero elements in a vector. This norm is almost always used in the mathematical foundations of all sparsity related problems. However, minimizing the \ell_0 norm can get a bit hard to solve (in fact, NP-hard) and thus is very often relaxed to the \ell_1 norm.

L1 norm: The \ell_1 norm sums up the absolute values of the vector. While minimizing it should reduce the overall absolute sum of the vector, it is often used as an approximation for inducing sparsity as values which are very (depends on the problem) close to zero, can often be ignored. There are algorithms to solve these problems such as Matching Pursuit.

Matrix Norms
While most vector norms can be directly extended to matrices, there are special matrix norms which present interesting properties.

Entry-wise norms: These methods basically treat the matrix as a set of vectors stacked for convenience and simply apply the standard set of vector norms on matrices. In general, they have no special properties.

L2 norm: Also referred to as the Frobenius norm \|A\|_F it can be computed as an entry-wise norm. However, it has more interpretations, as it can also be obtained via the square root of the trace of the matrix conjugate product \sqrt{trace(A^*A)} (always nice to have trace, since it allows matrix commutativity); or via the singular values by computing the \ell_2 norm on a “vector” made up of singular values. The minimization has a similar significance and results in convex problems.

Nuclear norm: Denoted as \|A\|_* the nuclear norm computes the equivalent \ell_1 norm on the “vector” of singular values. Its minimization induces sparsity (or zero-valuedness) amongst the singular values, which in turn is interpreted as minimizing the matrix rank. A non-full rank matrix is pretty interesting and can be used for dimensionality reduction via projections, for clustering by grouping data points into a few rows, etc.

Chaining norms: Operating on row or column vectors of a matrix, this norm is denoted as \|A\|_{p,q}. In this we first apply the \ell_q norm on the columns of the matrix, followed by a \ell_p norm on the resulting vector. Its minimization can be used to obtain interesting combinations of properties such as row sparsity with column regularization when p = 1, q = 2.

Please add more interesting norms / properties that I may have missed in the comments section!

BĂ©zier curves off the Diagonal

We all know that we can drag the center point of a diagonal in a plot and then move it around to make beautiful arcs. However, how are they mathematically generated and can they be easily controlled by a single parameter?

BĂ©zier curves for parameter 0, 0.25, 0.5, 0.75 and 1

BĂ©zier curves for \alpha parameter 0, 0.25, 0.5, 0.75 and 1

The answers to both above questions is yes! The above curves are essentially quadratic BĂ©zier curves with three control points. All of them have a relatively easy to pick start and end point. In the examples above we pick (1, 1) and (100, 100). However for any general row count N_R and column count N_C, the end point can (N_C, N_R) — points denoted by (x, y) convention which is columns first.

Now that we have two points, we can talk about the third control point. When this is the mid-point of the space (0.5 \cdot N_C, 0.5 \cdot N_R) the curve is a straight diagonal line as all points are collinear. However add an imbalance and we get deviations from the diagonal. For simplicity, these deviations can be parametrized by a single variable \alpha, which makes the third point (\alpha N_C, (1-\alpha) N_R).

The equations of the curves correspond to rational BĂ©zier curves with three points which have the additional property of also representing conic sections

\mathbf{B}(t) = \frac{\sum_{i=0}^2 {2 \choose i} t^i (1-t)^{2-i}\mathbf{P}_{i} } { \sum_{i=0}^2 {2 \choose i} t^i (1-t)^{2-i} }

where the value t sweeps from 0 to 1 in small steps to get the (x, y) coordinates at each step.

An application of this curve (my use-case) is to model the number of sentences (space) used to describe a movie / TV episode in a plot synopsis (plotted on the y-axis) corresponding to the shots (plotted on the x-axis). Empirical evidence suggests that authors spend more space describing the latter half of a video as it approaches the climax, thus allowing us to use \alpha \sim 0.65 as a model which fits better than the diagonal.

Affinity Diffusion

This post is based on a very simple, yet powerful technique presented at CVPR 2013. The authors essentially compare many existing graph-based diffusion techniques, generalize them and evaluate them to pick the best alternative. The goal — to improve the quality of affinity matrices.

M. Donoser and H. Bischof. Diffusion Processes for Retrieval Revisited. CVPR 2013. [paper]

Given N data points x_1, \ldots, x_N which can represent anything (objects, faces, actions, words, audio segments, etc.) we first build a pair-wise affinity matrix using an appropriate function a_{ij} = f(x_i, x_j) that captures their similarity. This affinity matrix (or its inverse, the distance matrix — which captures pair-wise differences/dissimilarities) is very commonly used in hierarchical clustering, or retrieval applications. In fact, it can also be seen as a kernel matrix where the function f is the kernel function.

Since many tasks such as clustering / retrieval / classification are based on this matrix, improving its quality, i.e. making similarity between elements from the same class higher and between different classes lower, can help improve performance drastically. That is precisely what this paper demonstrates, and although the MPEG-7 bulls-eye score may be a controversial measure to capture this performance improvement, an improvement in mAP is also observed.

So what is the magic that improves the affinity? It boils down to “elements similar to my neighbors are probably also similar to me”. Neighbors (in the nearest-neighbor sense) are elements that are similar to a selected element. In math, it essentially involves iterating over matrix multiplications a few times (5-10). The best combination that the authors pick is to initialize the iterations with the P_{kNN} a sparse K-NN matrix obtained from the original affinity matrix, and to update it by pre- and post-multiplication with itself until convergence. The update step is
A^{t+1} = P_{kNN}*A^t*P_{kNN}^T.
That’s it. The resulting converged matrix is then used for the application and shows improved performance.

Apart from the 3 data sets where the authors show that the above works, we also quickly tried this algorithm on face-based person retrieval and saw an improvement in mAP from 0.463 to 0.648! That is huge!

The authors also provide easy-to-use Matlab code to replicate their results, and use it for your problem.

Jinja to Visualize Shot Threads and Scenes

Jinja2 (docs) – a template rendering system for Python – is quite useful to visualize results in a quick and consistent way. In this post we see some images (frames of a video) which need to be presented in a repetitive structure. In case you do your research work in Matlab this is also easy since you can export your data as a mat file and load it up in Python with scipy.io.loadmat() (follow these instructions to make it easier).

The main idea is
import jinja2
template = jinja2.Template(open("path_template").read())
data = loadmat("path_to_matfile")
output_file_id.write(template.render(data[key]))

In my case of working with TV series and movies, one popular problem is shot threading. TV series and movies are typically shot from multiple cameras, and are heavily edited. Reverse engineering the editing can help put together shots from the same camera, which then goes on to help tasks such as person identification (they might be the same person!).

Threaded Shots from different cameras capturing a scene

Shots from different cameras capturing a scene

Visual analysis of the threading results can be quite hard due to the massive scale of the problem (imagine 700-1000 shots per 40min episode, and 20+ such episodes in a season). In the example above, we use Jinja to generate an HTML file with tables and links to the images. We see that the shot number below the image is staggered and indicates the editing pattern (click to view larger image and actually see the numbers).

Another popular problem is to divde the video into scenes, typically defined as a set of shots at the same location with the same set of people. Again visual analysis of the scene break detection algorithm yields itself very well to a template rendering system.

Scene break detection (5 shots per row)

Scene break detection (5 shots per row)

We see a correctly inserted scene break here (yaay!) as the characters in the story have changed.

Just as a final note, Jinja can do any form of template rendering. It is not restricted to HTML. Automatically generated LaTeX and then PDF files are certainly something to try 🙂

Labeled data, Unlabeled data and Constraints

The standard paradigm of machine learning is to learn a model from some labeled “training” data (X_l, y_l) and then apply this model on to some “test” data for which we wish to obtain labels. However, for most tasks where machine learning is applied, there is abundance of unlabeled data, and labeled data usually comes at a price (processing / manual annotations / sometimes even real money).

Recently there is a trend towards using unlabeled data X_u in the learning step and terming the methods as semi-supervised learning. The idea here is to use the fact that there exist data points (even unlabeled ones) in that space, and that it is not just empty space.

Semi-supervised Learning decision boundary (Wiki)

Semi-supervised Learning decision boundary (Wiki)

The figure above captures this point very well. If we just had two labeled points (filled and empty circles), the decision boundary would be a simple straight line maximally separating the two (top half of the figure). However, the very knowledge of having more points in that banana-shaped space, allows us to form much more intricate boundaries to better classify the points. The most popular notion for semi-supervised learning is to place the decision boundary respecting the labeled points, AND in a space with the least density of points.

In our framework, we go a step further and extend this idea to include constraints between data points. These constraints typically say whether a pair of data points belongs to the same class (positive constraint) or they belong to different classes (negative constraint). This allows to further refine the boundaries. For example, in the figure below we have a 3 class problem where only the colored data points (\bigcirc, +, \nabla) are labeled. The others (\times) are all unlabeled. Compare how the decision boundaries change to satisfy the constraints and the unlabeled data.

Labeled data, Unlabeled data and Constraints

Labeled data, Unlabeled data and Constraints

We formulate the problem using differentiable loss functions (making optimization easier and faster!) for each of the three components: labeled data, unlabeled data and constraints, and then apply it to improve the performance of face recognition in TV series. Labeled data is obtained from alignments between subtitles (what is spoken when) and transcripts (who speaks what) to obtain who speaks when. Unlabeled data is plenty! basically, all the face tracks which are not assigned a name. Currently we only use negative constraints between tracks occurring at the same time — two faces appearing at the same time should (mostly) belong to different people.

For more details, results, and analysis please refer to
M. BĂ€uml, M. Tapaswi, and R. Stiefelhagen. Semi-supervised Learning with Constraints for Person Identification in Multimedia Data. CVPR 2013. (download)