Category Archives: Mathematics

On Norms…

The mathematical norm of a vector (or a matrix) is associated with interesting properties for that vector (matrix) and trying to minimize it with other loss function terms or constraints yields interesting properties.

Vector Norms
The classical vector norms are \ell_0, \ell_1, \ell_2, \ell_\infty and all of them have some very interesting properties.

L2 norm: Probably the most common among them is the \ell_2 norm which is used in regularization or normalization. Minimization of the \ell_2 norm of a vector, usually a weight vector in classifiers such as SVMs, linear / logistic regression, etc. prevents imbalance and makes sure that no single value is very high compared to the others. The minimization problems are differentiable, convex and basically nice!

L-Inf norm: The \ell_\infty norm of a vector is the maximum absolute value of the vector and thus its minimization imposes limits on the range of numbers a vector can take.

L0 norm: The \ell_0 norm counts the number of non-zero elements in a vector. This norm is almost always used in the mathematical foundations of all sparsity related problems. However, minimizing the \ell_0 norm can get a bit hard to solve (in fact, NP-hard) and thus is very often relaxed to the \ell_1 norm.

L1 norm: The \ell_1 norm sums up the absolute values of the vector. While minimizing it should reduce the overall absolute sum of the vector, it is often used as an approximation for inducing sparsity as values which are very (depends on the problem) close to zero, can often be ignored. There are algorithms to solve these problems such as Matching Pursuit.

Matrix Norms
While most vector norms can be directly extended to matrices, there are special matrix norms which present interesting properties.

Entry-wise norms: These methods basically treat the matrix as a set of vectors stacked for convenience and simply apply the standard set of vector norms on matrices. In general, they have no special properties.

L2 norm: Also referred to as the Frobenius norm \|A\|_F it can be computed as an entry-wise norm. However, it has more interpretations, as it can also be obtained via the square root of the trace of the matrix conjugate product \sqrt{trace(A^*A)} (always nice to have trace, since it allows matrix commutativity); or via the singular values by computing the \ell_2 norm on a “vector” made up of singular values. The minimization has a similar significance and results in convex problems.

Nuclear norm: Denoted as \|A\|_* the nuclear norm computes the equivalent \ell_1 norm on the “vector” of singular values. Its minimization induces sparsity (or zero-valuedness) amongst the singular values, which in turn is interpreted as minimizing the matrix rank. A non-full rank matrix is pretty interesting and can be used for dimensionality reduction via projections, for clustering by grouping data points into a few rows, etc.

Chaining norms: Operating on row or column vectors of a matrix, this norm is denoted as \|A\|_{p,q}. In this we first apply the \ell_q norm on the columns of the matrix, followed by a \ell_p norm on the resulting vector. Its minimization can be used to obtain interesting combinations of properties such as row sparsity with column regularization when p = 1, q = 2.

Please add more interesting norms / properties that I may have missed in the comments section!

Advertisements

Bézier curves off the Diagonal

We all know that we can drag the center point of a diagonal in a plot and then move it around to make beautiful arcs. However, how are they mathematically generated and can they be easily controlled by a single parameter?

Bézier curves for parameter 0, 0.25, 0.5, 0.75 and 1

Bézier curves for \alpha parameter 0, 0.25, 0.5, 0.75 and 1

The answers to both above questions is yes! The above curves are essentially quadratic Bézier curves with three control points. All of them have a relatively easy to pick start and end point. In the examples above we pick (1, 1) and (100, 100). However for any general row count N_R and column count N_C, the end point can (N_C, N_R) — points denoted by (x, y) convention which is columns first.

Now that we have two points, we can talk about the third control point. When this is the mid-point of the space (0.5 \cdot N_C, 0.5 \cdot N_R) the curve is a straight diagonal line as all points are collinear. However add an imbalance and we get deviations from the diagonal. For simplicity, these deviations can be parametrized by a single variable \alpha, which makes the third point (\alpha N_C, (1-\alpha) N_R).

The equations of the curves correspond to rational Bézier curves with three points which have the additional property of also representing conic sections

\mathbf{B}(t) = \frac{\sum_{i=0}^2 {2 \choose i} t^i (1-t)^{2-i}\mathbf{P}_{i} } { \sum_{i=0}^2 {2 \choose i} t^i (1-t)^{2-i} }

where the value t sweeps from 0 to 1 in small steps to get the (x, y) coordinates at each step.

An application of this curve (my use-case) is to model the number of sentences (space) used to describe a movie / TV episode in a plot synopsis (plotted on the y-axis) corresponding to the shots (plotted on the x-axis). Empirical evidence suggests that authors spend more space describing the latter half of a video as it approaches the climax, thus allowing us to use \alpha \sim 0.65 as a model which fits better than the diagonal.

Log-Sum-Exp trick

It is quite common to process small numbers in probability (no really, so small that double precision 10^{-308} may not be enough). They are encountered while modeling with Gaussians, which are so ubiquitous today. However, it is only so much that one Gaussian can do, and hence the obvious extension is to use a mixture of Gaussians, weighted differently with different means and covariances.

The probability function for a mixture of multi-variate Gaussians looks like
p(x) = \sum_i w_i \frac{1}{\sqrt{(2\pi)^N |C_i|}} \exp[-\frac{1}{2} (x-\mu_i)^T C_i^{-1} (x-\mu_i)]
where both x and \mu_i are vectors of dimension N, and i a count of the number of Gaussians in the mixture.

Everyone knows the above equation, but the problem comes when actually trying to compute it. So typically papers report this Log-Likelihood, which is basically the \log(p(x)). Consider an example of 2 gaussians. Then, you may need to evaluate something like f(x) = \log(e^{-1000} + e^{-1001}). Those are not made up. It is quite easy to obtain a score of 1000. Direct evaluation will usually return -Inf, but we clearly know that it cannot be true, and has to be around -999.something.

The simple thing to do is rewrite f(x) as
\log(e^{-1000}(e^0 + e^{-1})) = -1000 + \log(1 + e^{-1}) \approx -999.7.

And that is the computational trick. Pull out the maximum, from the sum, leave the rest (which can be evaluated). Just FYI, the maximum value is exp(709) which lies just below the numerical Infinity.

Manifold Learning

Machine learning has faced a strange curse of dimensionality, where it becomes difficult to compare data points when they are really long vectors. This arises from the fact that most distances are normally computed in the Euclidean space and this gets quite complicated (see explanation of hypersphere/hypercube).

There are many dimensionality reduction techniques, the most famous being Principal Component Analysis which in fact works quite well when the data points are somewhat linearly distributed. However it produces disastrous results in case of non-linear distributions.

What is a manifold? The underlying idea is that in very large dimensional spaces, data points of physical processes usually tend to occur in restricted regions, and are almost never uniformly distributed through the entire space. A manifold is an imaginary surface on which these data-points are assumed to lie. Furthermore the key concept is that this manifold has very few degrees of freedom, and hence can be represented in a low-dimensional space.

The distance between the data points is now computed as the shortest distance while tracing along the manifold. For example, consider our Earth. To go from Delhi to New York you could drill through the ground and obtain the shortest distance in 3-D Euclidean space. However this is infeasible and the true shortest distance is actually the geodesic distance, the one when traveling on the surface of Earth. Here our Earth’s surface is our manifold in the large universe!

Given the data points, the objective is to find this manifold so that data-points close to each other on the manifold surface can be classified as belonging together instead of the ones which are close in Euclidean space. However finding this exact geometric shape is truly a Herculean task in most cases. Hence, the objective is transformed to find a function which can map the data from the high dimensional space to a simpler low dimensional space, such that the distance between the points in the low dimension corresponds to the geodesic distance as computed on the manifold.

Techniques such as ISOMAP and LLE developed in 2000 created quite a stir and were published in Science!! Here is a pretty interesting tutorial on manifolds, from which the above gist appears.

Cholesky decomposition for Matrix Inversion

Matrix inversion is a classical problem, and can be very complicated for large matrices. There are many ways to simplify this for special types of matrices. Among them, one is to transform the matrix into a set of upper or lower triangular matrices. Consider our target matrix \mathbf{A} which is Hermitian and positive-definite. Such matrices are quite famous and an example is the covariance matrix in statistics. It’s inverse is seen in the Gaussian probability density function for vectors. Then, Cholesky decomposition breaks

\mathbf{A} = \mathbf{L}\mathbf{L}^T = \mathbf{U}^T\mathbf{U}

where \mathbf{L} is a lower triangular matrix, while \mathbf{U} is an upper triangular matrix.

It is much easier to compute the inverse of a triangular matrix and there exist numerical solutions. Then the original matrix inverse is computed simply by multiplying the two inverses as

\mathbf{A}^{-1} = (\mathbf{L}^{-1})^T(\mathbf{L}^{-1}) = (\mathbf{U}^{-1})(\mathbf{U}^{-1})^T

As bonus, the determinant is also much easier to compute.

det(\mathbf{A}) = det(\mathbf{L})^2

One can also use complex matrices, and just use a conjugate-transpose instead of transpose alone.

On Distances…

Distances are used every day in statistics and applied mathematics for comparing, seeing how far things are and also tend to be the classifying feature for several pattern recognition problems. Here’s a short overview of the distances I know or had heard of. Please help add others!

Euclidean distance: Straight line distance between any 2 points in n-dimensional Euclidean space (normally most spaces are Euclidean). Most commonly used. More info
Absolute distance (Manhattan distance): Follows the layout of a city while measuring distance. Basically the measure of distance while taking paths that are parallel to the axes. More info
p-norm distance (Minowski distance): Euclidean space distance, generalized to the order of p. (p=1: Absolute, p=2: Euclidean). Derived from the p-norm concepts. More info

Chebyshev distance: The p-norm distance for p=infinity. It picks the axis with the maximum difference between the (corresponding) points of a vector. More info
Mahalanobis distance: Similar in principle to the Euclidean distance, however takes into account the correlation between the data. Widely used in data clustering methods. More info
Algebraic distance: A popular distance metric for the Minimum Mean Square Error solutions. Represented easily with vectors and matrices, it makes this a very powerful tool. More info

Bhattacharya distance: A symmetric measure of distance used to compare probability distributions. eg. Could be used in images for comparing histograms. More info
Kullback-Leibler distance: Used typically by the information theory people, it is an asymmetric measure between probability distributions. It can simplify a lot of Info. Theory proofs 🙂 More info
Earth Mover’s distance: Thought of as the amount of “Earth” one needs to move to make two probability distributions similar. Used typically with image histograms. More info

Hausdorff distance: Idea from set theory however, can be extended to matching shapes using edge images. It is also a measure of how similar two 3D models are in computer graphics. More info
Hamming distance: The number of different symbols while comparing two strings (typically bits). Can be used to measure bit-flips or errors in a data. More info

Metric Learning: See this post.

Symbolic Math Toolbox – Matlab

Your scientific calculator now does definite integration, differentiation, etc. Maybe higher and more complex graphing calculators can do indefinite stuff too! However, I just found this very interesting ability to use a math symbol, typically x, so far known to exist only in Mathematica. This is the Symbolic Math Toolbox, and its uses are numerous.

Although I am sure it requires a lot of development, specially compatibility of integrating with other data types that Matlab supports, for starters, it seems like a really nice feature. The example where I used it is like this. Consider a simple 3×3 symmetric matrix, whose non-diagonal corners are equal, and unknown. You want to obtain a value for this, such that the determinant of the matrix goes to 0. Manually, yes, its easy. Its a simple quadratic equation, and it can be solved for. However, now consider a 20×20 matrix 🙂 Its terrible to compute its determinant, and then there is still an equation to solve. What symbolic math toolbox allows you to do is, define a variable – say z. In my case it would then be

syms z
Rxx = [1, 0.9, 0.8, ... z; ...; ...; z, ..., 0.8, 0.9, 1];
solve(det(Rxx));

and you have the possible solutions for this variable z. Note the usage of the function solve() which is typically used to solve for variable given a polynomial equation.

It also does all the standard integration, differentiation, and even Taylor’s series expansion 🙂 and has a lot of applications. I want to now check it out for the basic transforms – Fourier, Laplace, Z, etc. Best to refer to the Matlab documentation on the toolbox for more information and examples on this. I am really loving this combination of the power of “indefinite” math along with powerful numeric functions.