# On Norms…

The mathematical norm of a vector (or a matrix) is associated with interesting properties for that vector (matrix) and trying to minimize it with other loss function terms or constraints yields interesting properties.

Vector Norms
The classical vector norms are $\ell_0, \ell_1, \ell_2, \ell_\infty$ and all of them have some very interesting properties.

L2 norm: Probably the most common among them is the $\ell_2$ norm which is used in regularization or normalization. Minimization of the $\ell_2$ norm of a vector, usually a weight vector in classifiers such as SVMs, linear / logistic regression, etc. prevents imbalance and makes sure that no single value is very high compared to the others. The minimization problems are differentiable, convex and basically nice!

L-Inf norm: The $\ell_\infty$ norm of a vector is the maximum absolute value of the vector and thus its minimization imposes limits on the range of numbers a vector can take.

L0 norm: The $\ell_0$ norm counts the number of non-zero elements in a vector. This norm is almost always used in the mathematical foundations of all sparsity related problems. However, minimizing the $\ell_0$ norm can get a bit hard to solve (in fact, NP-hard) and thus is very often relaxed to the $\ell_1$ norm.

L1 norm: The $\ell_1$ norm sums up the absolute values of the vector. While minimizing it should reduce the overall absolute sum of the vector, it is often used as an approximation for inducing sparsity as values which are very (depends on the problem) close to zero, can often be ignored. There are algorithms to solve these problems such as Matching Pursuit.

Matrix Norms
While most vector norms can be directly extended to matrices, there are special matrix norms which present interesting properties.

Entry-wise norms: These methods basically treat the matrix as a set of vectors stacked for convenience and simply apply the standard set of vector norms on matrices. In general, they have no special properties.

L2 norm: Also referred to as the Frobenius norm $\|A\|_F$ it can be computed as an entry-wise norm. However, it has more interpretations, as it can also be obtained via the square root of the trace of the matrix conjugate product $\sqrt{trace(A^*A)}$ (always nice to have trace, since it allows matrix commutativity); or via the singular values by computing the $\ell_2$ norm on a “vector” made up of singular values. The minimization has a similar significance and results in convex problems.

Nuclear norm: Denoted as $\|A\|_*$ the nuclear norm computes the equivalent $\ell_1$ norm on the “vector” of singular values. Its minimization induces sparsity (or zero-valuedness) amongst the singular values, which in turn is interpreted as minimizing the matrix rank. A non-full rank matrix is pretty interesting and can be used for dimensionality reduction via projections, for clustering by grouping data points into a few rows, etc.

Chaining norms: Operating on row or column vectors of a matrix, this norm is denoted as $\|A\|_{p,q}$. In this we first apply the $\ell_q$ norm on the columns of the matrix, followed by a $\ell_p$ norm on the resulting vector. Its minimization can be used to obtain interesting combinations of properties such as row sparsity with column regularization when $p = 1, q = 2$.