Manifold Learning

October 12, 2011
Machine learning has faced a strange curse of dimensionality, where it becomes difficult to compare data points when they are really long vectors. This arises from the fact that most distances are normally computed in the Euclidean space and this gets quite complicated (see explanation of hypersphere/hypercube).

There are many dimensionality reduction techniques, the most famous being Principal Component Analysis which in fact works quite well when the data points are somewhat linearly distributed. However it produces disastrous results in case of non-linear distributions.

What is a manifold? The underlying idea is that in very large dimensional spaces, data points of physical processes usually tend to occur in restricted regions, and are almost never uniformly distributed through the entire space. A manifold is an imaginary surface on which these data-points are assumed to lie. Furthermore the key concept is that this manifold has very few degrees of freedom, and hence can be represented in a low-dimensional space.

The distance between the data points is now computed as the shortest distance while tracing along the manifold. For example, consider our Earth. To go from Delhi to New York you could drill through the ground and obtain the shortest distance in 3-D Euclidean space. However this is infeasible and the true shortest distance is actually the geodesic distance, the one when traveling on the surface of Earth. Here our Earth’s surface is our manifold in the large universe!

Given the data points, the objective is to find this manifold so that data-points close to each other on the manifold surface can be classified as belonging together instead of the ones which are close in Euclidean space. However finding this exact geometric shape is truly a Herculean task in most cases. Hence, the objective is transformed to find a function which can map the data from the high dimensional space to a simpler low dimensional space, such that the distance between the points in the low dimension corresponds to the geodesic distance as computed on the manifold.

Techniques such as ISOMAP and LLE developed in 2000 created quite a stir and were published in Science!! Here is a pretty interesting tutorial on manifolds, from which the above gist appears.


Liquid Rescale

August 7, 2011
Image resizing for most cases today works directly on pixel level irrespective of the content. A first step to not distort the information is to keep the aspect ratio the same. However this too is not very good, specially if the content of focus in the image was small to begin with, thus making it even smaller. Liquid rescale is an awesome technique for content-aware image resizing and the publication can be found here. The video below demonstrates its applications, and also quickly explains the main concept.

The idea behind this is astonishingly simple. Instead of removing a straight column (row) of pixels to reduce the width (height), a so-called vertical (horizontal) seam is removed. The seam is a connected path of low energy pixels in the image. Thus, removing a seam does not alter the high energy content, which is usually semantically useful.

An image is analyzed and seams are ordered from low to high energy. The seams with lowest energy are removed first for reducing the size of the image, and subsequent seams are removed to reach the desired image size. The image can also be expanded by adding in extra pixels in these seam locations with values as the average of the neighbours. The video below throws light on how these seams look and shows nice applications.

A Gimp plugin can be found here and imagemagick incorporates it in here.


Cholesky decomposition for Matrix Inversion

July 8, 2011
Matrix inversion is a classical problem, and can be very complicated for large matrices. There are many ways to simplify this for special types of matrices. Among them, one is to transform the matrix into a set of upper or lower triangular matrices. Consider our target matrix \mathbf{A} which is Hermitian and positive-definite. Such matrices are quite famous and an example is the covariance matrix in statistics. It’s inverse is seen in the Gaussian probability density function for vectors. Then, Cholesky decomposition breaks

\mathbf{A} = \mathbf{L}\mathbf{L}^T = \mathbf{U}^T\mathbf{U}

where \mathbf{L} is a lower triangular matrix, while \mathbf{U} is an upper triangular matrix.

It is much easier to compute the inverse of a triangular matrix and there exist numerical solutions. Then the original matrix inverse is computed simply by multiplying the two inverses as

\mathbf{A}^{-1} = (\mathbf{L}^{-1})^T(\mathbf{L}^{-1}) = (\mathbf{U}^{-1})(\mathbf{U}^{-1})^T

As bonus, the determinant is also much easier to compute.

det(\mathbf{A}) = det(\mathbf{L})^2

One can also use complex matrices, and just use a conjugate-transpose instead of transpose alone.


Clustering and Matlab

June 14, 2011
Clustering data is a useful technique for compact representation (vector quantization), statistics (mean, variance of group of data) and pattern recognition(unsupervised classification). In this post, we shall briefly see the two major types of clustering techniques, and then look at how easily Matlab deals with them.

One class of the techniques is Hierarchical, usually Agglomerative clustering. As is clear from the words itself, agglomerative clustering involves grouping data points most “near” to each other. The measure of nearness is defined by a pre-defined measure of distance (usually Euclidean). A tree structure is built and we move from each data point being its own cluster to a 1-cluster system. The user then has to decide at which point should the clustering process stop. For example, one could stop the system when the next agglomeration involves combining clusters some x or more distance away. The other way would be to define a maxclusters criterion. However, as we shall see further that sort of defeats the purpose of hierarchical clustering.

In Matlab, T = clusterdata(X,'cutoffType',cutoffThreshold) does all the clustering work and returns the cluster classes in T. However more insight can be obtained by performing each task individually.

  1. Y = pdist(X,'type') computes distance between data points
  2. Z = linkage(Y,'average') gets the tree linkages matrix
  3. T = cluster(Z,'cutoffType',cutoffThreshold) does the actual clustering

The major advantage of breaking the steps is the ability to view awesome graphs, and observe the goodness of your clustering. dendrogram(Z,'colorThreshold','default') shows the tree structure with each cluster getting a unique colour. One should also try out the silhouette() and cophenet() functions.

The other major group is the standard Partition clustering. The classical algorithm is the K-Means where K clusters are formed. In this the user fixes the number of clusters at the start. Random initial cluster centers are picked (usually from the data points). The iterative process then involves assigning “nearest” points to the cluster center, followed by an update of the cluster center itself. This is repeated until convergence, usually such that the cluster assignment does not change (if data size is tractable).

In Matlab, the command [T, C] = kmeans(X,k) clusters the data points X into k groups, returning the centers C and the target assignment T. More info

The difference between the two types is obvious. Partitional requires defining the number of clusters while the Hierarchical method needs the user to specify some sort of cutoff threshold to stop the agglomeration process. One needs to decide what is the more suitable criteria for a given problem.

PS: You can try out the functions by creating some data on your own by using mean-shifted Gaussians.
X = [-2+randn(100,1); randn(100,1); 2+randn(100,1)];


Census Transform and Faces

March 31, 2011
Another transform to the bag of so many feature extraction methods, the Census Transform (CT) and Modified CT (MCT) seems to be quite an interesting way of representing images in a very local neighbourhood. The transform provides high resilience against global illumination changes and thus is quite useful for face detection across wide illumination problems. Combined with the standard AdaBoost it proves to be an effective face detector scheme. Additionally it seems to be also used in applications in image retrieval research too. Histograms of the transform too have been used for further analysis.

The concept of CT is extremely simple. Used on grayscale images, consider any pixel and its 8 neighbours. Just assign boolean values 0/1 to the pixels who have a value lower/higher than the center respectively. Scan them in row order and this generates an 8-bit stream for each pixel and is the new transform value at that pixel. The central pixel is ignored.

On the other hand, the MCT makes a small change in this by saying compare with the mean of the 3×3 block rather than the center value. It can now also use the central pixel for comparison. A similar operation provides 9 bits which are the transform value for that pixel. A different block size like 5×5 can be used too for both, however 3×3 is usually the more favoured one.

An example figure from the paper – Face Detection with Modified Census Transform is shown here and clearly shows that the vast change in global illumination or gradient doesn’t affect the local pattern much.

Face Detection with MCT

Face Detection with MCT


Markov Random Fields

March 11, 2011
Graphical models are an interesting approach to pattern recognition. Most pattern recognition problems can be simplified to computation of the posterior probability of a class, given an observation. This conditional probability is represented in terms of a graphical model. Although both directed and undirected graphs can be used, Markov Random Fields (MRFs) deal with undirected graphs.

Their main idea is quite a simple one. First, the problem is converted to a graph. An example would be to associate each random variable or observation with a node, and join edges of those nodes which seem to be related with each other. Then, cliques of this graph are considered together. The maximal cliques in the graph (taking into account each node in atleast one of them) are associated with a strictly positive potential function. The joint probability of the model is then just a normalised product of the individual clique potential functions.

Now, since we require positive potential functions, a typical example is to use the exponential function. They are used as \phi_{X_C} = e^{-E(X_C)} where E is the “energy function”. It is easy to see that the joint probability (a product of individual potentials \phi_{X_C}) is now the exponential of a sum of energies; and following the entropy concept, we have the lower energies associated with higher probabilities.

These energy functions are chosen for each clique such that the desired response has lower energy, and the undesired has higher energy. Typically they could be functions of the random variables involved in that clique. They also include “constant” parameters (weights) so that more important cliques can exert dominance on the total probability of the model. These parameters then are learnt.

MRFs seem to be a really nice way of combining multiple observations or pattern recognition tasks intended towards a certain goal. The probabilistic approach followed here can be expected to beat the heuristics that would otherwise be employed. More on algorithms for training and testing in posts to come!

The book on Pattern Recognition and Machine Learning by Bishop, provides a brief idea about Graphical Models in Chapter 8 and is provided as the sample chapter.


On Distances…

February 11, 2011
Distances are used every day in statistics and applied mathematics for comparing, seeing how far things are and also tend to be the classifying feature for several pattern recognition problems. Here’s a short overview of the distances I know or had heard of. Please help add others!

Euclidean distance: Straight line distance between any 2 points in n-dimensional Euclidean space (normally most spaces are Euclidean). Most commonly used. More info
Absolute distance (Manhattan distance): Follows the layout of a city while measuring distance. Basically the measure of distance while taking paths that are parallel to the axes. More info
p-norm distance (Minowski distance): Euclidean space distance, generalized to the order of p. (p=1: Absolute, p=2: Euclidean). Derived from the p-norm concepts. More info

Chebyshev distance: The p-norm distance for p=infinity. It picks the axis with the maximum difference between the (corresponding) points of a vector. More info
Mahalanobis distance: Similar in principle to the Euclidean distance, however takes into account the correlation between the data. Widely used in data clustering methods. More info
Algebraic distance: A popular distance metric for the Minimum Mean Square Error solutions. Represented easily with vectors and matrices, it makes this a very powerful tool. More info

Bhattacharya distance: A symmetric measure of distance used to compare probability distributions. eg. Could be used in images for comparing histograms. More info
Kullback-Leibler distance: Used typically by the information theory people, it is an asymmetric measure between probability distributions. It can simplify a lot of Info. Theory proofs :) More info
Earth Mover’s distance: Thought of as the amount of “Earth” one needs to move to make two probability distributions similar. Used typically with image histograms. More info

Hausdorff distance: Idea from set theory however, can be extended to matching shapes using edge images. It is also a measure of how similar two 3D models are in computer graphics. More info
Hamming distance: The number of different symbols while comparing two strings (typically bits). Can be used to measure bit-flips or errors in a data. More info


bsxfun

November 16, 2010
Just found this awesome function in Matlab called bsxfun. It is rather helpful, and fast too as illustrated below. Usage can be found at Matlab documentation (ofcourse!) or an example as below. It is called in a way a little similar to the late (deprecated) blkproc (or now, blockproc) function with an argument as a function. The “@fun” part.

The example here basically computes the difference, B is subtracted from each row of A.

>> A = rand(1000,3);
>> B = rand(1,3);
>> tic; C1 = bsxfun(@minus,A,B); toc;
Elapsed time is 0.003523 seconds.
>> tic; C2 = A - ones(1000,1)*B; toc;
Elapsed time is 0.090703 seconds.

Note that it performs the subtraction on the columns, (length 3). In case you specify B as B = rand(1000,1) then it will perform the subtraction on the rows (length 1000).

Another way to achieve the same operation, without using a loop, would be using repmat, but it is the slowest of them all. Ofcourse, you should NOT even think of using a loop for such a thing in Matlab!


SSH and screen

November 12, 2010
I just learnt the amazing power of screen with ssh, and put it to use for my first time, so here’s a post :) Consider you are working on some data on a remote computer/server. You can access this server, but only intermittently. Also, you dont have the patience to keep some window open to stare at the progress of your work, but would like to visit it now and then to know how much of it is complete. Screen is the answer to your problem. The work could be from running simulations (C / C++ / other Terminal executable languages) and even can be extended to Matlab! On a console-only SSH, running matlab defaults to the terminal since it cannot open a graphical interface, and this aspect is exploited here.

The most useful feature of screen is the ability to disconnect / close the terminal, without quitting the associated programs. Further, you can reconnect to this “screen” whenever you wish. The terminology used is “detach” and “attach”. Here’s how its done.

  1. SSH to your account, wherever it is…
  2. Start a session of screen by typing: screen This will auto-generate a weird session id. This can be renamed, but details later
  3. Now, start your simulation/work in this terminal.
  4. Detach: Next, you can just close this terminal window, knowing well that your work is being carried out by the server
  5. In case you want to look at the status, firstly SSH back to the account and do a screen -ls to get a list of running screens
  6. Attach: Connect to your screen by typing screen -r <name_as_in_list>, and you are back to your work!

For more details on the usage of screen, visit this nice and short tutorial.


MPEG-7 Colour: Dominant Colour Descriptor

September 7, 2010
This post deals with another prominent image / video descriptor – Colour. The Dominant Colour Descriptor of the MPEG-7 is quite a useful tool for “query by example” applications. As the name indicates, the most dominant colour in the image is presented. The colour is tuned to be as close as possible to the original.

The main use can be for image retrieval. This colour and maybe amount could be used as a “feature” of the image, and then compared with the rest of the images to check for similarity. Also there are other descriptors like the colour structure descriptor, colour layout descriptor, and the basic histograms. Combined with these, an effective feature can be formed and used for the required application.

Firstly, the standard uses the CIE L*a*b colour space. This is because the Euclidean distances are more perceptual in this domain rather than the conventional RGB. (i.e. some distance X in RGB may not change the colour equally in all directions). The MPEG-7 standard suggests the so-called top-down technique. In this strategy, we first obtain one dominant colour, as the centroid of all pixels in the image.

The algorithm then follows by splitting this cluster. The cluster with the highest distortion is divided into two clusters by adding a small perturbation vector. However, note that this vector is not just any random vector. It is computed (using eigen values and vectors of the covariance matrix) in the direction of maximum variance. Thus we obtain two new centroid locations. These are further updated by the generalized Lloyd algorithm, i.e. assign each point to cluster of closer proximity, and then re-compute the centroid for the given set of points.

Since, we pick and split the centroids with max distortion, it is possible to get any number of dominant colours for the image. Below are example images of the original image, and 3 (left top), 4 (right top), 5, 6, 7 (left bottom) and 8 (right bottom) dominant colour images (click for larger view).

Original Image

Original Image

3 to 8 Dominant Colour Image

3 to 8 Dominant Colour Image


Follow

Get every new post delivered to your Inbox.