Category Archives: Multimedia Compression

FLAC – Lossless Audio

The Free Lossless Audio Codec (FLAC), is a file format specially meant for audio which gives a lossless output (i.e. Perfect reconstruction can be achieved). Well, after MP3, its debatable whether a perfectly lossless file is actually required, but technology development still happens πŸ™‚ So, since its specially meant for audio, its compression factor is more than the standard lossless file archival formats like zip / rar.

So what exactly makes this FLAC lossless, yet smaller than the other techniques. The encoding procedure can be understood in simple small steps. Firstly, a blocking of the samples is carried out. Each such encoded block is packed into a frame with its headers and footers consisting of CRCs and other stuff. The blocking size is a crucial criteria, as a small size would mean a higher frame data overhead, and a large blocking size would mean a badly fitting model. By default a block size of 4096 samples is chosen for the 44.1kHz sample rate.

In the next step, the encoder tries to approximate this signal whose residual would then require lesser bits to encode. The two main ways of modelling are either by fitting simple polynomials to the signal or by using linear predictive coding (LPC). The model parameters form the information to be stored. Ofcourse LPC requires more bits than polynomials, but generally tends to give more accurate models.

The LPC is a model of our human speech production system. In extremely simple words, it assumes the voice production to be mainly a stream of impulses passing through our throat and sound production system, a response function, and LPC tries to model this very function which can give a close approximation of the audio sample’s spectrum.

The model is now generated, and residual coding is carried out on the left over (signal – model) residual. This is generally broken into several partitions, and is found to have a Laplacian distribution, on which a special type of Huffman code, called the Rice code is applied for higher efficiency. Each such partition has a Rice parameter which is adjusted for that small distribution.

Thus the frame is then packed, called the framing operation with a header containing crucial information like sample rate, bits per sample, etc. The data then follows and is closed by a CRC of the entire encoded frame for error detection.

The main point to note here is unlike most other compression procedures there is no quantisation operation. This is what makes this system entirely lossless, as all the above described procedures are perfectly reversible.


MP3 – Compression Strategy

Continuing from the previous post giving a little information about MP3 in general, we will now look at the actual compression methodology.

The time-domain samples, PCM input is fed to a Polyphase Filterbank which filters the incoming 1152 samples into 32 equally spaced frequency sub-bands. But this leads to a 32-fold increase in the number of samples, and hence decimation by 32, is carried out in each sub-band. Simultaneously a 1024-point and 256-point FFT is carried out to give good resolution and information on spectral change.

This FFT data is fed to the Psychoacoustic Model block. Algorithms in this block model the human sound perception. This provides information on the window to be used for the signals in 32 sub-bands. Normal, Start, Stop and Short are the Windows defined in the MPEG standard. Detection of dominant tonal components is carried out and critical band masking thresholds are calculated. These are the scalefactors bands for which the quantization noise is to be kept within limits.

Now, Windowing is carried out on each of the 32 sub-bands. A Modified DCT is applied to each time frame of the sub-band samples which are thus split into 18 finer sub-bands creating a granule with a total of 576 lines.

The Scaling and Non-uniform Quantization is now applied to these 576 spectral values at a time. This is done in two nested loops distortion control loop (outer) and rate control loop (inner).

Its enough to understand here that the rate control loop does the quantization of frequency domain samples and also determines the required quantization step-size. The quantization step is increased so as to keep the number of Huffman coded bits lower than the total available bits (fed as CBR requirement initially). On the other hand, the distortion control loop manages the quantization noise, and keeps it below the masking threshold computed for the scalefactors.

These quantized values, finally satisfying the required criteria are now Huffman coded. The Huffman code related side information is given by the loops and is stored in the side information of each frame. Finally the bit-stream formatting and CRC generation is carried out, and frames representing 1152 samples are put out.

Information about the decompression technique and more details can be found at this well-written paper: β€œThe Theory Behind MP3” by Rassol Raissi.

MP3 – All you wanted to know!

Today, we’ll have a look at the most famous of multimedia content – MP3 files. Firstly, MP3 is not MPEG-3 but is MPEG-1 Layer 3. Three such compression techniques were initiated by the MPEG (Moving Pictures Experts Group) for audio only, and were called the Layer 1, 2, and 3, increasing in complexity and performance. The Layer 3 achieves a whopping compression from 1.4Mbps to 128kbps (approx. 12:1) without any audible degradation and thus its popularity. This standard is given the number ISO/IEC – 11172-3 and was accepted in 1993.

The MP3 is a perceptual codec, by which we mean that it exploits the fact that our human ear is not sensitive to a lot of data (sounds) and also is affected by temporal and simultaneous or tonal masking. The human ear has 24 frequency bands. Now, when one tone in any band is above the masking threshold, its not possible for our ear to hear the other frequency components in that band. This is Tonal masking. On the other hand, Temporal masking occurs when two tones are played at the same time, and hence the stronger (in volume) tone dominates over the weaker one. This also causes a small duration of pre-masking (~50ms) and post-masking (~100ms) where we cannot hear softer sounds.

One important aspect that needs to be clear is that the MP3 generally functions at 128kbps but can function at other bit-rates too. This is an important specification in the encoding procedure. Also possible is the Variable Bit Rate (VBR) which assigns a specific bit rate to every frame, but this has drawbacks of possibility of incorrect timing display in the decoder / player and problems with broadcasting.

All MP3 files are divided into 1152 sample frames that last for 26ms. This gives a frame rate of 38 frames / second. Each frame consists a 4 byte header, could have a 16 byte CRC, side information which consists information to decode the main data. This is followed by the main data which has thresholds for the scalefactor bands which reduce the quantization noise, the Huffman coded bits, and an optional ancillary data.

Finally, this is appended by a 128-byte tag known as ID3 to enter textual information like β€˜TAG’ (3), title (30), artist (30), album (30), year (4), comment (28), an all-zero byte (1), track number (1) and genre (1). The next step in this is the ID3v2 which is a very recent standard, and not yet commonly found. This is placed at the beginning of a file and has a dynamic size which thus does not impose restrictions on information size.

We shall continue with the actual compression strategy in the next post as this one has grown too long already! πŸ˜‰

Waveform Audio Format – .wav Files

This post shall deal with the basics of Audio File storage. We have covered most of the image formats, and I’ll try to put up something on Scalable Vector Graphics (seem interesting :)) after learning it. Anyways, so this wave format is a standard put in by Microsoft and IBM and is derived from the RIFF (Resource Interchange File Format). The difference from the tagged chunks IFF format is the usage of the little-endian for multi-byte reading. The IFF was introduced to exchange data between various companies (Apple, Microsoft and IBM) effectively in the early days. The well known AVI (audio-visual interleaved data) too is a part of this RIFF standard.

This format is extremely simple (like the BMP case in images). I would say there is no encoding on the data which is represented by PCM (Pulse Coded Modulation). This uncompressed case was good in the early days where the Internet transfer of files wasn’t so regular. With the coming and popularisation of the Internet it was felt to reduce these wave file sizes and thus came the now famous MP3.

A crude measurement of the file size can be done in this way:
Time of audio file (in seconds) x No. of Channels x Samples per second (Sampling Rate) x Bits per sample.

Generally used values for each are 2 channel, 44.1kHz (CD Quality), 48kHz (DVD Quality), 8kHz (decent for only speech) and 8-bit (on the lower side), 16-bit (good), 32-bit (overkill!) per sample PCM code.

Now coming to the part of the bit-stream, this is done by making chunks as mentioned earlier. There is a header which gives all the required details of the file, and is followed by the data chunks. Important in the header are the no. of channels, audio format, the sample rate, and bits per sample.

Its interesting to note that 8-bit samples are stored as unsigned integers (0 to 255) while 16-bit samples are stored as 2’s complement integers from -32768 to 32767. For more information on the byte-structure and bit-stream I would suggest to have a look at this to allow me from copying the same redundant information. πŸ™‚

Why 44.1kHz and not 40kHz for Audio

So, this was one of the questions asked during my Nvidia interview (when I said I was interested in Multimedia). Why use a 44.1kHz sampling rate rather than 40kHz when most humans can’t hear above the 18.5kHz range or definitely the 20kHz mark. Well I didn’t know the answer then, but here it is now. Its mostly quoted from John Watkinson’s book, The Art of Digital Audio.

So basically, “In the early days of digital audio research, the necessary bandwidth of about 1 Mbps per audio channel was difficult to store. Disk drives had the bandwidth but not the capacity for long recording time, so attention turned to video recorders. These were adapted to store audio samples by creating a pseudo-video waveform which would convey binary as black and white levels. The sampling rate of such a system is constrained to relate simply to the field rate and field structure of the television standard (NTSC / PAL – note the refresh rates used ahead), so that an integer number of samples can be stored on each usable TV line in the field. Such a recording can be made on a monochrome recorder, and these recording are made in two standards, 525 lines at 60 Hz and 625 lines at 50 Hz. Thus it is possible to find a frequency which is a common multiple of the two and is also suitable for use as a sampling rate.

The allowable sampling rates in a pseudo-video system can be deduced by multiplying the field rate by the number of active lines in a field (blanking lines cannot be used) and again by the number of samples in a line. By careful choice of parameters it is possible to use either 525/60 or 625/50 video with a sampling rate of 44.1KHz.”

Ah ha.. so thats the point. Digital audio stored on video recorders were to be sampled at 44.1kHz to best suite the video setup and not the audio. But we know that tradition continues.. and even today in the days of MP3 (ripped again from CD / DVD / 44.1kHz content) we still do not bother to resample the data (a thing which could harm quality). A few calculations were further explained as follows.

“In 60 Hz video, there are 35 blanked lines, leaving 490 lines per frame or 245 lines per field, so the sampling rate is given by 60 X 245 X 3 = 44.1 KHz. While, in 50 Hz video, there are 37 lines of blanking, leaving 588 active lines per frame, or 294 per field, so the same sampling rate is given by 50 X 294 X3 = 44.1 Khz.

The sampling rate of 44.1 KHz came to be that of the Compact Disc. Even though CD has no video circuitry, the equipment used to make CD masters is video based and determines the sampling rate.”

Thats the secret behind the fishy 44.1kHz sampling rate πŸ™‚

The Wavelet Transform Development and Images

In this post, we shall basically look at how the DWT works for images and what are its implications. Also a small look at why wavelets started shall give more insight into the compression technology used in the JPEG2000.

We know of Fourier analysis which transforms our view from the time-domain to the frequency-domain. But a serious drawback was the loss of time information once in frequency-domain which caused problems to analysis of most real-life non-stationary signals. This was corrected by introducing the Short Time Fourier Transform (STFT) which gave frequency v/s time information, as the signal was windowed to form the spectrogram. Again, the problem in this approach was the determination of window size which would be fixed over the entire signal. We would thus lose critical information with extreme variation in frequency.

Thus arose the need for Wavelet analysis which could employ a variable size window. A long window for low-frequency-resolution and a shorter window for high-frequency-resolution. These wavelets were defined as waveforms of limited duration with an integral value of zero.

The wavelet analysis involves breaking up of a signal into shifted and scaled versions of the above mentioned wavelet. The data thus obtained is a time-scale representation with the magnitude representing the correlation between the wavelet and section of the signal. Thus a low-scale i.e. compressed wavelet captures rapidly changing details which are of a higher frequency, and a high-scale i.e. stretched wavelet captures slowly changing coarse details of a lower frequency.

Now, calculating this “local” Continuous Wavelet Transform at every point and scale generates too much data, hence we choose a subset of these scales and positions. This is the Discrete Wavelet Transform (DWT). It is found that if the scales and positions are based on the powers of two, dyadic in nature; the analysis is much more efficient and equally accurate. The practical approach to this is a simple filtering algorithm. A low-pass filter generates the high-scale (long) approximate coefficients and a high-pass filter which generates the low-scale (short) detail coefficients. The resulting coefficients are sub-sampled by 2 as to keep the total amount of data the same.

In case of 2-dimensional data (images), say matrix A, the rows are first wavelet transformed (columns can be transformed first too.. doesn’t matter) to give GrA (approx.) and HrA (detail). These are then transformed column-wise to give GcGrA (left top), GcHrA (right top – horizontal details), HcGrA (left bottom – vertical details) and HcHrA (right bottom – diagonal details). The resulting coefficients are shown in tree form for the image below.

Wavelet Decomposition of Image

Wavelet Decomposition of Image

Well thats it, so to sum it up, the coefficients are filtered parts of the signal such that the filters are designed to capture the entire frequency range (eg. digital frequency: 0 – 0.5 and 0.5 – 1).

JPEG2000 – Brief Methodology

The JPEG released in 1994, was succeeded by another amazing compression procedure by the JPEG (Joint Photographic Experts Group) in 2000. Hence the name JPEG2000. This is filed under the ISO/IEC 15444-1 and has an extension in the ISO/IEC 15444-2.

In this post we shall just briefly understand how the technique works. More discussions on Wavelet Transform and Arithmetic Coding and their exact usage in JPEG2000 will come in subsequent posts.

The most significant difference is the Wavelet Transform which is way better than the Cosine Transform. The advantages of which are seen in the much higher compression ratio (can easily offer 3-4 times better compression than JPEG, without perceptual loss in quality). But the required resource consumption is higher than that for JPEG and this is the major drawback and reason for us not seeing JPEG2000 compressed images frequently. Also, at high compression ratios, JPEG breaks down due to the blocking artifacts, while JPEG2000 may just get a little blur.

Anyways, lets get started! We start off with the standard RGB to YCbCr conversion, but this does not involve the sub-sampling of the Chroma components the reason for which is discussed later.

These layers may then be Tiled (not compulsory). This tiling makes the decoder require lesser memory, but the usage of multiple smaller tiles may then show up blocking artifacts like the JPEG compression method. Generally the whole image is processed as it is.

These tiles are then individually DC Level shifted (mean-shift) to ensure a average sum of 0. The next step, carries out multiple levels of the Wavelet transform. Three to five levels of transform are generally applied.

Each level of the wavelet transform on 2-D data generates an approximation coefficient (left-top) and 3 detail coefficients. In the chroma layers, the first level of detail coefficients are directly ignored and thus we obtain an image similar to a sub-sampled one. The image below gives an insight into the same.

3-Level Haar Wavelet Transform

3-Level Haar Wavelet Transform

Once we have all the detail and approximation coefficients, a global quantization may be performed which puts a high percentage of numbers to zero. Next, a general divide by magnitude quantization is carried out. All these quantization operations are the main reasons for LOSSY compression. Yet, JPEG2000 has a feature to have the quantization step size set at 1.0 on the usage of 5/3 Wavelets which gives a lossless compression.

Now these image tiles are divided into sub-bands and sub-bands into precincts, followed by precincts into code-blocks. These code blocks are coded using a bit-plane coding technique called EBCOT (Embedded Block Coding with Optimal Truncation).

Bits obtained from the EBCOT are called packets and passed through the binary MQ-Coder. These are finally stored with markers and in layers to enable a multi-resolution effect (similar to that of PNG’s progressive loading of layers).

More on Wavelets and the Coding Procedure in posts to come.