From Real to Complex to Vector Gaussians

November 13, 2009
Gaussian distributions are probably the most widely used distributions in mathematics and engineering. Though one strange aspect of them is that they have slightly different equations for the real and complex cases. This is an interesting problem, and has a really nice reasoning, which generally is not taught in classes. In an attempt to answer the why here goes…

All of us have definitely at some point of time seen this familiar real Gaussian distribution
p_x(x) = \frac{1}{\sqrt{2\pi \sigma^2}} e^{-\frac{x^2}{2\sigma^2}}

which changes to a real vector form as
p_x(\mathbf{x}) = \frac{1}{\sqrt{(2\pi)^N |\mathbf{C_x}|}} e^{-\frac{1}{2}\mathbf{x}^T \mathbf{C_x}^{-1}\mathbf{x}}

where x is a vector of Gaussian distributions x_1, x_2, \ldots, x_N with 0 mean and covariance matrix \mathbf{C_x}. This can be written directly under the assumption that the individual x_i are uncorrelated, and since uncorrelated gaussians are independent, a multiplication of the individual gaussian distributions can be carried out.

Further, there is a change when referring to complex distributions. This is due to the difference in the definition of the Covariance matrix itself, which can be now written as a covariance matrix of the real and imaginary parts \mathbf{C_z} or the covariance matrix generated by using the complex number directly as \mathbf{C_s} where \mathbf{z} = [\mathbf{x y}]^T and \mathbf{u} =  [\mathbf{u u^*}]^T are both column vectors of 2N size.

This gives a relation between \mathbf{C_z} = \frac{1}{2} \mathbf{T}^H \mathbf{C_s} \mathbf{T} where \mathbf{T} is a 2-D unitary matrix. This half factor is responsible for the disappearance of the 2 in the fraction, and |\mathbf{C_s}| = |\mathbf{C_u}|^2 is responsible for the removal of the square root. Thus the final Complex Guassian Multivariate distribution ends up (different from the real one) as

p_u(\mathbf{u}) = \frac{1}{(\pi)^N |\mathbf{C_u}|} e^{-\mathbf{u}^T \mathbf{C_u}^{-1}\mathbf{u}}

when \mathbf{u} is circulant complex random variable, i.e. the covariance of real and imaginary parts is same, and they are uncorrelated (generally satisfied in applications).

One last point, its important to note that independence implies uncorrelation, but uncorrelation need not imply independence except for the nice and widely used Gaussian case :)


Classification and Regression Trees – CART

October 29, 2009
One often finds that given a set of 2-class mixed data, a common problem is to separate them. Ofcourse, there are a huge number of optimal and sub-optimal ways to achieve this, but this post shall talk only about the CART method. The algorithm is pretty simple to understand, implement and its working complexity is low. Note that it can be actually used to separate data of any number of classes, but let us consider the simplest case (2-class problem).

We start with a set of questions we can ask the data so as to classify them. In a 2-D dataset (x,y), they could be as simple as x>0 or y>0 and so on. The obvious problem to do this manually is how would one select the questions, and in which order (note that, the solution is like an incomplete binary tree) do we ask them so as to be able to classify the fastest with a high accuracy.

Multiple splitting criteria are available, some of which could be minimum combined entropy of the child nodes, the minimum global entropy, or maximum likelihood, etc. But the concept in all is the same. In every step, we choose a question which gives 2 classes of which atleast 1 contains the dataset of mainly one class (this is the low entropy case mathematically). The ideal case is when it contains data of only one class, where we have 0 entropy.

It can be seen that the global entropy (of all leaf nodes) reduces as we increase the number of splits. However there has to be a limit, and the stopping criteria are defined pretty intuitively. They could be a fixed number of leaf nodes, or a threshold global entropy, or a threshold change in global entropy, or just based on how many splits are to be performed.

This technique is often used in speech recognition systems which suffer from insufficient training of triphones. So we cluster the triphones based on their linguistic properties like voiced/unvoiced/plosive/fricative/etc. (our questions in this case) Thus a similar model can be used for each of the triphones (around 6000 to cover all) which land in the same node after a CART analysis. This has been found to reduce the relative error by around 15%.


Two-Way Mismatch – A Pitch Detection Algorithm

October 12, 2009
Pitch Detection (F0 frequency) has been a big problem for many many years now. One of the solutions, proposed way back in 1993 was to use a Two-Way Mismatch algorithm. Its an excellent idea, and works much better and faster than the standard or even modified autocorrelation technique. The proposed method specifically works well for pitch tracking in music signals.

The working is extremely simple. Small segments of the time domain signal are analysed and their spectrum computed. A peak detection algorithm is run on this which obtains the amplitudes and frequencies (bins) of the strong harmonics seen in the spectrum. The important parameter here is the FFT size and the window length. This is because the frequency resolution and time resolution are dependent on this factor, as described in this earlier post on Formants and Harmonics.

Now, the pitch variation for that music signal or speech signal is generally known. Hence, a range of pitch values are chosen, and their harmonics are computed. These harmonics are are tried to match with the measured ones of the signal. The closer the match, the better the estimation of the pitch. This matching technique ensures that if a few peaks are missed in the signal, they will be omitted and penalised accordingly. Also extra peaks will be penalised.

This matching technique is performed in both ways. Matching the predicted harmonics with the measured partials and the other way round, measured partials with predicted harmonics. A weighting of these two errors is carried out, and the final Pitch estimate error is obtained. The global minimum is computed among the chosen range of pitch values, and this corresponds to the pitch frequency of that segment.

A smoothing (median filtering) may be carried out later in case of speech signals which generally tend to have a constant-pitch over small segments of time.


MS Word – Matlab Interface

October 2, 2009
This is what we use in our labs here at UPC. An interesting feature of Matlab to be able to insert code like text in Microsoft Word, and then execute it.

The idea is like this. Given a problem statement or objective, write your code in the word document, and execute it by pressing Ctrl+Enter. You see the output (of course as long as you dont have ; at the end of command) in blue, and the input in green. This helps in generating answer scripts automatically and there is no need to be wasting time in generating lab session reports! In the screenshot below, you see the things mentioned.

Matlab and Word interface - notebook

Matlab and Word interface - notebook

Yes, the picture is bad quality :( so I expect you would want to try it out yourself. Note that, you have the variables used in the document, now available in the Matlab variables list. You can do all your testing in the Matlab window itself, to simplify and check the working of your code or you can use the editor and write m-files for functions, which can then be invoked in the word document. All the standard stuff can be done. Images can be inserted by copying the Figure window and just pasting it back in the appropriate place in the Word document. Newer versions of Matlab do this automatically! :) (I don’t know how new though).

Starting from scratch: While creating the documents, you would just say notebook('filename'). The command is essentially notebook and this will open a word document. Once you are done setting all the questions, and giving hint codes etc. you can save and quit both Matlab and Word.

More Coding: To study and code again, the next time you open this document it will automatically start Matlab, thereby starting the automation server. The unfilled, blank copy can be given to students in document form, to complete and then just post back the doc file which contains the code, and the results. Its perfect for lab sessions, and maybe even for exams.

Found this to be an interesting feature :) which was never used before.

Other features include that you can be in the folder (or add to path) the common functions that you have created and need to use.


Sound Converter

September 25, 2009
I wanted to convert some large mp3 files to wav for processing yesterday, and here’s what I found. Its extremely good. The name is Sound Converter and is available on Ubuntu by doing just a sudo apt-get install soundconverter.

This simple tool supports MP3, WAV, OGG, FLAC and the AAC formats, which are the most commonly used audio formats. So next time you want to convert a file, and its huge enough making the use of http://media-convert.com/ tough, use this :)

Thats it.. small post this one.


p-Norm and Unit Circle Pics

September 18, 2009
This was just out of curiosity and now it seems like a nice way to prove that any vector’s unit circle \infty-norm is a square, which is nothing but the maximum of all the scalars in that vector.

Well, this is mainly because am undergoing a course on Matrix Algebra right now, and wanted to experiment using Matlab somehow! :) So, a norm or more generally the p-norm for a vector is basically

\|x\|_p = (|x_1|^p + |x_2|^p + ... + |x_N|^p)^{1/p}

Yes, and am bloody glad that I can use \LaTeX code in WordPress :) So nice of them. Here’s how its done. Just insert the lines $ latex \|x\|_p = (|x_1|^p + |x_2|^p + ... + |x_N|^p)^{1/p}$ in your required section. No space is to be put between $ and latex, I couldn’t show without it though as it was converting to \LaTeX.

Anyways, so to find the unit circle (only 1 quadrant) I just did a scatter plot of the random number vectors whose norm was less than 1, and it looks like this.

Unit Circles of Norm - Matlab Simulations

Unit Circles of Norm - Matlab Simulations

This will let you know that the unit-circle went from a triangle like shape, to circle and then is bulging towards the (1,1) point. It reaches this only at \infty. Note that this is just in the positive quadrant, but its symmetrical in all quadrants of the 2-D space considered here.


Radon Transform and Tomography

September 11, 2009
The Radon transform is widely being used in a whole lot of image processing applications. Its use at detecting lines in noisy images is extremely powerful. So what exactly is this Radon Transform?

Take an image, and take its horizontal projection (sum along each row at 0 degrees). Now, rotate the image and take another projection, and so on, take projections for various angles. What one obtains is a matrix – projection columns vs varying angles.

In Matlab, this can be done using [RadonImage, k]=radon(input_image,theta) where theta are the angles over which the projection is taken and k the line number (perpendicular distance from center of image of each line) It can be visualised nicely by imagesc(theta,k,RadonImage).

Now, Tomography is an imaging method by sectioning. CTScans (computed tomography scan), MRIs, and many other medical, oceanographic, geophysical, etc. imaging carried out in this way. For now, we shall restrict ourselves to a 2d image, and its single dimensional projections. Once these projections are obtained, a backprojection can be carried out to regenerate the 2d image. The algorithm used here, is of main concern, as we want our algorithm to be efficient both in time and complexity and require lesser number of projections for reconstruction.

One of these is the Filtered Back-Projection which makes use of the concept of Fourier Slice Transform. This tells us that, the 1d Fourier Transform of the projection, is equal to the 2d Fourier Transform of the image evaluated on the line whose projection was used earlier.

Further, in the mathematics its seen that, there comes a transformation from rectangular to polar coordinates which introduces a determinant of Jacobian. This can be multiplied with the 1d transforms of projections directly to obtain something called filtered back-projection. These back-projections are the slow development of the original image. More and more back-projections help in the reconstruction of the original image. Basically, “given the projection at a specific angle, we could reconstruct the image something like this” is what the method says.

To the user’s concern, Matlab offers a function called iradon. It can be called as Reconstructed_image = iradon(RadonImage, theta). Note that the theta used here should be the same ones used while computing the Radon Transform.

For a better mathematical treatment, I would refer this post to this Rice Univ. page.


Time and Frequency – Formants and Harmonics

August 30, 2009
This post shall initially talk about the time-frequency uncertainty (yes, its like Heisenberg’s position-momentum uncertainty principle). Further, we shall look at Formants and Harmonics and the corresponding wideband and narrowband spectrograms.

So, what exactly is the time-frequency problem. In the generation of a spectrogram, its a practice to take a window of a certain size, and then take that time duration signal’s FFT. Thus it generates a time-frequency graph. Now, the size of the window is very important to note. In simple steps

Long window – Bad time resolution – Good frequency resolution – Narrowband spectrogram – Harmonics
Short window – Good time resolution – Bad frequency resolution – Wideband spectrogram – Formants

What do we exactly mean by time and frequency resolution. Time resolution is how well we can point out to a section of the audio and say that this analysed spectrum belongs to this particular part. On the other hand, frequency resolution is saying that this frequency component accurate to a few Hz is part of the windowed speech sample.

So, the wideband spectrogram, shows clear formants (5ms window). But decreasing the time window too much, smears the frequency graph too much making it tough to detect formants. Formants are basically acoustic resonances of the human vocal tract. Our vocal tract continuously changes shape to produce all the sounds that we make. These formants can be used alone to classify all the vowels (essentially voiced phonetics) that we produce. Generally, the F1 and F2 are used to do this. Vowels are found to generally have around 4 formants. There are some other factors like plosives close to vowels, which lower the formant frequencies. Below is a picture of the formants.

Formant Structure - seen in red

Formant Structure - seen in red

Now coming to harmonics, these are nothing but multiples of the fundamental frequency (also called F0) – the pitch. These are seen in narrow-band spectrograms (30ms window). Harmonics have a role to play in speaker recognition and speech recognition too. Below is a picture clearly showing the harmonics in speech.

Harmonics - Bands with darker regions of formants

Harmonics - Bands with darker regions of formants

These images were generated using a software called Praat. May write a post on that sometime.


FLAC – Lossless Audio

August 24, 2009
The Free Lossless Audio Codec (FLAC), is a file format specially meant for audio which gives a lossless output (i.e. Perfect reconstruction can be achieved). Well, after MP3, its debatable whether a perfectly lossless file is actually required, but technology development still happens :) So, since its specially meant for audio, its compression factor is more than the standard lossless file archival formats like zip / rar.

So what exactly makes this FLAC lossless, yet smaller than the other techniques. The encoding procedure can be understood in simple small steps. Firstly, a blocking of the samples is carried out. Each such encoded block is packed into a frame with its headers and footers consisting of CRCs and other stuff. The blocking size is a crucial criteria, as a small size would mean a higher frame data overhead, and a large blocking size would mean a badly fitting model. By default a block size of 4096 samples is chosen for the 44.1kHz sample rate.

In the next step, the encoder tries to approximate this signal whose residual would then require lesser bits to encode. The two main ways of modelling are either by fitting simple polynomials to the signal or by using linear predictive coding (LPC). The model parameters form the information to be stored. Ofcourse LPC requires more bits than polynomials, but generally tends to give more accurate models.

The LPC is a model of our human speech production system. In extremely simple words, it assumes the voice production to be mainly a stream of impulses passing through our throat and sound production system, a response function, and LPC tries to model this very function which can give a close approximation of the audio sample’s spectrum.

The model is now generated, and residual coding is carried out on the left over (signal – model) residual. This is generally broken into several partitions, and is found to have a Laplacian distribution, on which a special type of Huffman code, called the Rice code is applied for higher efficiency. Each such partition has a Rice parameter which is adjusted for that small distribution.

Thus the frame is then packed, called the framing operation with a header containing crucial information like sample rate, bits per sample, etc. The data then follows and is closed by a CRC of the entire encoded frame for error detection.

The main point to note here is unlike most other compression procedures there is no quantisation operation. This is what makes this system entirely lossless, as all the above described procedures are perfectly reversible.


Our Amazing Ears and Brain – Cocktail Party Effect

August 21, 2009
I was reading about this article and found it to be really interesting. So what is the Cocktail Party Effect!

Say we are in a crowded and noisy place (a party for example) where lots of people are talking and someone is talking to us too. Even in such surroundings, we can clearly talk and listen with a specific person who we want to talk with. So we are in a way rejecting a lot of unwanted noise that is present in the surroundings. Yet, if our name is called out by someone somewhere, we hear that and shift our response towards that direction.. and all this happens automatically :) This brings to question, are we actually listening to all that junk and checking whether its relevant to us ?

Another beautiful example is the “what-did-you-say” phenomenon. When someone asks us a question, and we are not attentive often our reaction is “uh, what did you say?” But then, before the question is repeated, we manage to comprehend what was told to us a few seconds ago from memory. This experiment was actually tried in laboratory, the results agreed with the above which confirm: there is a temporary memory for sounds to which we are not attending, but this is not a long-term memory.

On the other hand, our speech recognition systems today, fail miserably on only having noise in the to-be-recognized speech. Forget multiple people talking and such reality issues. In view of this amazing capability of the brain, the research is on to provide what they call it Auditory Scene Analysis. Put as another example, a baby, imitates his mother’s voice, but does not insert the cradle squeaks that have occurred simultaneously when the mother was speaking with him. The baby successfully rejects the squeaks as not being part of the perceptual object formed by the mother’s voice. This probably is just a way of saying that our systems today are not even at par to babies :( and looking at the bright side; well, there’s a lot of work to do :)

So some of the ideas developed so far are to use the spatial location and continuity of the sound source, difference in loudness, visual channel cues which may be influencing our brain to a high extent. We know for certain that a dog cannot talk like a human, and thus we have done some sense of basic scene analysis, but the computer is yet to understand that. In recent days, Harmonics and Frequency Modulation, the Pitch Trajectory, are seen to provide much needed understanding into auditory scene analysis. Visual cues and Auditory source location and continuity together seem to be promising too.

These has varying applications from multi-user conference systems, where multiple people could speak together, to our own handhelds which are now losing their keypads (Nokia 5800). Conversing with a mobile should ensure that only our voice is being heard and separated else a shut-down said somewhere far away could cause problems :)

A summary of these ideas will be to provide as much continuity within a foreground stream as possible, while making them as differentiable from other background streams as is practical, without adding so many effects that the required data becomes messy.