Category Archives: MPEG-7

MPEG-7 Colour: Dominant Colour Descriptor

This post deals with another prominent image / video descriptor – Colour. The Dominant Colour Descriptor of the MPEG-7 is quite a useful tool for “query by example” applications. As the name indicates, the most dominant colour in the image is presented. The colour is tuned to be as close as possible to the original.

The main use can be for image retrieval. This colour and maybe amount could be used as a “feature” of the image, and then compared with the rest of the images to check for similarity. Also there are other descriptors like the colour structure descriptor, colour layout descriptor, and the basic histograms. Combined with these, an effective feature can be formed and used for the required application.

Firstly, the standard uses the CIE L*a*b colour space. This is because the Euclidean distances are more perceptual in this domain rather than the conventional RGB. (i.e. some distance X in RGB may not change the colour equally in all directions). The MPEG-7 standard suggests the so-called top-down technique. In this strategy, we first obtain one dominant colour, as the centroid of all pixels in the image.

The algorithm then follows by splitting this cluster. The cluster with the highest distortion is divided into two clusters by adding a small perturbation vector. However, note that this vector is not just any random vector. It is computed (using eigen values and vectors of the covariance matrix) in the direction of maximum variance. Thus we obtain two new centroid locations. These are further updated by the generalized Lloyd algorithm, i.e. assign each point to cluster of closer proximity, and then re-compute the centroid for the given set of points.

Since, we pick and split the centroids with max distortion, it is possible to get any number of dominant colours for the image. Below are example images of the original image, and 3 (left top), 4 (right top), 5, 6, 7 (left bottom) and 8 (right bottom) dominant colour images (click for larger view).

Original Image

Original Image

3 to 8 Dominant Colour Image

3 to 8 Dominant Colour Image


Ranking Retrieval

In this world of ever expanding Multimedia, its retrieval and proper search through query is an important task. Currently, image retrieval tasks have started coming up, but are still not up to the mark. When I say image retrieval, it means looking at some characteristic properties of images, and not its associated text to search for the common features.

Query by example is a typical method of retrieval, in which we give a sample image, and ask to search in the database for similar images. An audio approach to this is the query by humming, where one could hum the tune, and all similar tunes should show up. As can be imagined this seems to be quite an extraordinary job – searching through ALL the multimedia content out there (the WWW) is almost impossible!

Anyway, this post is about the development of ranking how good / bad is the proposed retrieval algorithm. What would you typically like? To have all the similar images corresponding to the query to come up. To minimize wrong images popping up. To somehow rank them in order of their content similarity, with highest rank being most similar. On the whole, the Average Normalized Modified Retrieval Rank (ANMRR) proposes to do exactly that. It converts a seemingly subjective task to an objective, quantitative one.

The initial approach was to average the retrieval rate – a rate which measures how many images are found below a certain number among the ground truth (labelled real similar images) for each query. However setting this “certain number” could soon be a problem. Also, since the number of ground truth images (NG) differ for each query, we would bias this ARR by the smallest group.

To counter the first problem, a rank was assigned for each kth ground truth image. Then, a “relevant ranks” threshold K (typically twice the size of ground truth for that query) was set which indicates the level of tolerance for that query and a new rank was formulated as the actual Rank or a worst-case scenario of 1.25K. This is averaged across all the images retrieved to give the Average Rank (AVR).

Further, to counter the bias introduced by different sizes of ground truth sets, a Modifed Retrieval Rank (MRR) is defined as the difference between the Average Rank and half the ground truth size+1. The MRR = 0 in case all images are perfectly found and ranked but the upper bound is still dependent on NG(query). This is then normalized to prevent the mismatch in number of ground truth images to obtain a Normalized Modified Retrieval Rank (NMRR) as
NMRR = \frac{AVR - 0.5[1+NG]}{1.25K - 0.5[1+NG]}

This NMRR is averaged over all queries to obtain the ANMRR. A lower value of ANMRR (around say 0.1) indicates quite a good performance, while above 0.5 certainly means the algorithm needs a good review. This ANMRR is seen to coincide with subjective evaluation of the results. It is also the base comparison criteria for the MPEG-7 Colour Descriptors and their experiments.

The paper titled Color and Texture Descriptors, provides a nice compact explanation of the same.

MPEG-7 Texture: Edge Histogram Descriptor

The main descriptors discussed in MPEG-7 for the case of video / still images are Texture, Colour, Shape and Motion. We shall start this series of posts with the analysis of Texture.

Texture and Colour are by far the most important and yet simple features that are most intuitive for describing objects or pictures. MPEG-7 has three main descriptors related to texture. Two of them are related to Homogeneous Texture Descriptor while another is the Edge Histogram Descriptor (EHD). This post shall deal with the analysis of the latter which can be quite useful for scene classification in a video sequence, specially that of sports videos.

C.S. Won et. al. describe the main implementation of the EHD and one of the major applications – Image Retrieval. It also has nice figures to get a clear idea of the procedure.

The idea is that of local processing. The image is divided into 4×4 sub-images. Each sub-image is further divided into smaller image blocks (typically 4×4 pixels). The standard allows for the having vertical, horizontal, diagonal (45 and 135 degrees) and non-directional edges. If the image block is a monotone, no edge is counted. Simple filtering of the image blocks allows to obtain the most prominent edge in the block. A histogram of 5 bins (1 for each edge type) is computed over all the image blocks in the sub-image. This procedure is repeated for all the 16 sub-images and hence we obtain 80 histogram coefficients. The standard proposes non-linear quantisation for the sake of storage (3 bits / coefficient).

Edges are quite important for our visual system, and these coefficients quantify them in a proper way. It is definitely a basis for Image Retrieval – querying for images by example.

The MPEG Story and MPEG-7 Descriptor Series

MPEG-7 in the series of standardising technology by the Moving Picture Experts Group is a really cool standard. There is a paradigm shift in the way one looks at multimedia, and semantic analysis is given much more importance.

MPEG-1 and 2 focussed mainly on compression of data, both audio and video. Even today, MPEG-2 keeps evolving and incorporating the new standards that are being laid out for higher compression retaining better quality. The goal of MPEG-2 is absolutely clear Coding of Moving Pictures and Associated Audio and it aims to do exactly that.

MPEG-4 on the other hand, is the start of a different way of looking at the coding of video. It is defined as the Coding of Audio-Visual Objects and all through the standard, objects are given a lot of importance. Transcoding, interoperability, error resilience, etc. are the other usual features. MPEG-4 now incorporates so many parts, that its too exhaustive to list here.

MPEG-4 became quite complex with all its syntax and header information. The organisation was so complicated that when H.264 came about, their official goal actually included “Simple Syntax specification” ๐Ÿ˜‰ along with Highly improved coding efficiency. MPEG-4 however has now adopted the H.264 standard as part of their 14496-10 Advanced Video Coding section.

Coming to MPEG-7, this is formally named the Multimedia Content Description Interface and understanding, classifying, retrieving content is now the most important goal. High-level semantic information is the need of the hour and this standard provides several descriptors that could help to perform this analysis.

In the following months, I would like to try and review some of the descriptors (features) that are being used in the MPEG-7 standard. Ofcourse, the main intention is for me to learn them as usual ๐Ÿ™‚

PS: There was no MPEG-3, 5, 6, 8, 9, etc. The latest is MPEG-21 of the 21st century. MPEG-3 the HD specifications was initiated, but was eventually combined with MPEG-2. Proposals for 5 (serial), 6 (even number, since 3 failed), 8 (powers of 2) were made, and 7 was chosen ๐Ÿ˜› Ask Chiari Glione why!