On Norms…

The mathematical norm of a vector (or a matrix) is associated with interesting properties for that vector (matrix) and trying to minimize it with other loss function terms or constraints yields interesting properties.

Vector Norms
The classical vector norms are \ell_0, \ell_1, \ell_2, \ell_\infty and all of them have some very interesting properties.

L2 norm: Probably the most common among them is the \ell_2 norm which is used in regularization or normalization. Minimization of the \ell_2 norm of a vector, usually a weight vector in classifiers such as SVMs, linear / logistic regression, etc. prevents imbalance and makes sure that no single value is very high compared to the others. The minimization problems are differentiable, convex and basically nice!

L-Inf norm: The \ell_\infty norm of a vector is the maximum absolute value of the vector and thus its minimization imposes limits on the range of numbers a vector can take.

L0 norm: The \ell_0 norm counts the number of non-zero elements in a vector. This norm is almost always used in the mathematical foundations of all sparsity related problems. However, minimizing the \ell_0 norm can get a bit hard to solve (in fact, NP-hard) and thus is very often relaxed to the \ell_1 norm.

L1 norm: The \ell_1 norm sums up the absolute values of the vector. While minimizing it should reduce the overall absolute sum of the vector, it is often used as an approximation for inducing sparsity as values which are very (depends on the problem) close to zero, can often be ignored. There are algorithms to solve these problems such as Matching Pursuit.

Matrix Norms
While most vector norms can be directly extended to matrices, there are special matrix norms which present interesting properties.

Entry-wise norms: These methods basically treat the matrix as a set of vectors stacked for convenience and simply apply the standard set of vector norms on matrices. In general, they have no special properties.

L2 norm: Also referred to as the Frobenius norm \|A\|_F it can be computed as an entry-wise norm. However, it has more interpretations, as it can also be obtained via the square root of the trace of the matrix conjugate product \sqrt{trace(A^*A)} (always nice to have trace, since it allows matrix commutativity); or via the singular values by computing the \ell_2 norm on a “vector” made up of singular values. The minimization has a similar significance and results in convex problems.

Nuclear norm: Denoted as \|A\|_* the nuclear norm computes the equivalent \ell_1 norm on the “vector” of singular values. Its minimization induces sparsity (or zero-valuedness) amongst the singular values, which in turn is interpreted as minimizing the matrix rank. A non-full rank matrix is pretty interesting and can be used for dimensionality reduction via projections, for clustering by grouping data points into a few rows, etc.

Chaining norms: Operating on row or column vectors of a matrix, this norm is denoted as \|A\|_{p,q}. In this we first apply the \ell_q norm on the columns of the matrix, followed by a \ell_p norm on the resulting vector. Its minimization can be used to obtain interesting combinations of properties such as row sparsity with column regularization when p = 1, q = 2.

Please add more interesting norms / properties that I may have missed in the comments section!


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s