Clustering and Matlab

Clustering data is a useful technique for compact representation (vector quantization), statistics (mean, variance of group of data) and pattern recognition(unsupervised classification). In this post, we shall briefly see the two major types of clustering techniques, and then look at how easily Matlab deals with them.

One class of the techniques is Hierarchical, usually Agglomerative clustering. As is clear from the words itself, agglomerative clustering involves grouping data points most “near” to each other. The measure of nearness is defined by a pre-defined measure of distance (usually Euclidean). A tree structure is built and we move from each data point being its own cluster to a 1-cluster system. The user then has to decide at which point should the clustering process stop. For example, one could stop the system when the next agglomeration involves combining clusters some x or more distance away. The other way would be to define a maxclusters criterion. However, as we shall see further that sort of defeats the purpose of hierarchical clustering.

In Matlab,
`T = clusterdata(X,'cutoffType',cutoffThreshold)` does all the clustering work and returns the cluster classes in T. However more insight can be obtained by performing each task individually.

1. `Y = pdist(X,'type')` computes distance between data points
2. `Z = linkage(Y,'average')` gets the tree linkages matrix
3. `T = cluster(Z,'cutoffType',cutoffThreshold)` does the actual clustering

The major advantage of breaking the steps is the ability to view awesome graphs, and observe the goodness of your clustering. `dendrogram(Z,'colorThreshold','default')` shows the tree structure with each cluster getting a unique colour. One should also try out the `silhouette()` and `cophenet()` functions.

The other major group is the standard Partition clustering. The classical algorithm is the K-Means where K clusters are formed. In this the user fixes the number of clusters at the start. Random initial cluster centers are picked (usually from the data points). The iterative process then involves assigning “nearest” points to the cluster center, followed by an update of the cluster center itself. This is repeated until convergence, usually such that the cluster assignment does not change (if data size is tractable).

In Matlab, the command `[T, C] = kmeans(X,k)` clusters the data points X into k groups, returning the centers C and the target assignment T. More info

The difference between the two types is obvious. Partitional requires defining the number of clusters while the Hierarchical method needs the user to specify some sort of cutoff threshold to stop the agglomeration process. One needs to decide what is the more suitable criteria for a given problem.

PS: You can try out the functions by creating some data on your own by using mean-shifted Gaussians.
`X = [-2+randn(100,1); randn(100,1); 2+randn(100,1)];`