# Two-Way Mismatch – A Pitch Detection Algorithm

Pitch Detection (F0 frequency) has been a big problem for many many years now. One of the solutions, proposed way back in 1993 was to use a Two-Way Mismatch algorithm. Its an excellent idea, and works much better and faster than the standard or even modified autocorrelation technique. The proposed method specifically works well for pitch tracking in music signals.

The working is extremely simple. Small segments of the time domain signal are analysed and their spectrum computed. A peak detection algorithm is run on this which obtains the amplitudes and frequencies (bins) of the strong harmonics seen in the spectrum. The important parameter here is the FFT size and the window length. This is because the frequency resolution and time resolution are dependent on this factor, as described in this earlier post on Formants and Harmonics.

Now, the pitch variation for that music signal or speech signal is generally known. Hence, a range of pitch values are chosen, and their harmonics are computed. These harmonics are are tried to match with the measured ones of the signal. The closer the match, the better the estimation of the pitch. This matching technique ensures that if a few peaks are missed in the signal, they will be omitted and penalised accordingly. Also extra peaks will be penalised.

This matching technique is performed in both ways. Matching the predicted harmonics with the measured partials and the other way round, measured partials with predicted harmonics. A weighting of these two errors is carried out, and the final Pitch estimate error is obtained. The global minimum is computed among the chosen range of pitch values, and this corresponds to the pitch frequency of that segment.

A smoothing (median filtering) may be carried out later in case of speech signals which generally tend to have a constant-pitch over small segments of time.

# Time and Frequency – Formants and Harmonics

This post shall initially talk about the time-frequency uncertainty (yes, its like Heisenberg’s position-momentum uncertainty principle). Further, we shall look at Formants and Harmonics and the corresponding wideband and narrowband spectrograms.

So, what exactly is the time-frequency problem. In the generation of a spectrogram, its a practice to take a window of a certain size, and then take that time duration signal’s FFT. Thus it generates a time-frequency graph. Now, the size of the window is very important to note. In simple steps

Long window – Bad time resolution – Good frequency resolution – Narrowband spectrogram – Harmonics
Short window – Good time resolution – Bad frequency resolution – Wideband spectrogram – Formants

What do we exactly mean by time and frequency resolution. Time resolution is how well we can point out to a section of the audio and say that this analysed spectrum belongs to this particular part. On the other hand, frequency resolution is saying that this frequency component accurate to a few Hz is part of the windowed speech sample.

So, the wideband spectrogram, shows clear formants (5ms window). But decreasing the time window too much, smears the frequency graph too much making it tough to detect formants. Formants are basically acoustic resonances of the human vocal tract. Our vocal tract continuously changes shape to produce all the sounds that we make. These formants can be used alone to classify all the vowels (essentially voiced phonetics) that we produce. Generally, the F1 and F2 are used to do this. Vowels are found to generally have around 4 formants. There are some other factors like plosives close to vowels, which lower the formant frequencies. Below is a picture of the formants.

Formant Structure - seen in red

Now coming to harmonics, these are nothing but multiples of the fundamental frequency (also called F0) – the pitch. These are seen in narrow-band spectrograms (30ms window). Harmonics have a role to play in speaker recognition and speech recognition too. Below is a picture clearly showing the harmonics in speech.

Harmonics - Bands with darker regions of formants

These images were generated using a software called Praat. May write a post on that sometime.

# Why 44.1kHz and not 40kHz for Audio

So, this was one of the questions asked during my Nvidia interview (when I said I was interested in Multimedia). Why use a 44.1kHz sampling rate rather than 40kHz when most humans can’t hear above the 18.5kHz range or definitely the 20kHz mark. Well I didn’t know the answer then, but here it is now. Its mostly quoted from John Watkinson’s book, The Art of Digital Audio.

So basically, “In the early days of digital audio research, the necessary bandwidth of about 1 Mbps per audio channel was difficult to store. Disk drives had the bandwidth but not the capacity for long recording time, so attention turned to video recorders. These were adapted to store audio samples by creating a pseudo-video waveform which would convey binary as black and white levels. The sampling rate of such a system is constrained to relate simply to the field rate and field structure of the television standard (NTSC / PAL – note the refresh rates used ahead), so that an integer number of samples can be stored on each usable TV line in the field. Such a recording can be made on a monochrome recorder, and these recording are made in two standards, 525 lines at 60 Hz and 625 lines at 50 Hz. Thus it is possible to find a frequency which is a common multiple of the two and is also suitable for use as a sampling rate.

The allowable sampling rates in a pseudo-video system can be deduced by multiplying the field rate by the number of active lines in a field (blanking lines cannot be used) and again by the number of samples in a line. By careful choice of parameters it is possible to use either 525/60 or 625/50 video with a sampling rate of 44.1KHz.”

Ah ha.. so thats the point. Digital audio stored on video recorders were to be sampled at 44.1kHz to best suite the video setup and not the audio. But we know that tradition continues.. and even today in the days of MP3 (ripped again from CD / DVD / 44.1kHz content) we still do not bother to resample the data (a thing which could harm quality). A few calculations were further explained as follows.

“In 60 Hz video, there are 35 blanked lines, leaving 490 lines per frame or 245 lines per field, so the sampling rate is given by `60 X 245 X 3 = 44.1 KHz`. While, in 50 Hz video, there are 37 lines of blanking, leaving 588 active lines per frame, or 294 per field, so the same sampling rate is given by `50 X 294 X3 = 44.1 Khz`.

The sampling rate of 44.1 KHz came to be that of the Compact Disc. Even though CD has no video circuitry, the equipment used to make CD masters is video based and determines the sampling rate.”

Thats the secret behind the fishy 44.1kHz sampling rate ðŸ™‚

# Audio Manipulation in Matlab

As you would expect, Matlab can manipulate with audio. Saving and Opening .wav files, playing them, and recording are the basic operations.

The record operation, is carried out by creation of something called an object (similar to Object-Oriented Programming). The object under consideration for recording is audiorecorder.

Obviously Matlab help covers all the below mentioned, but here is a personal experience touch to it. You instantiate an object by just calling it and assigning a name to it. Also are given a few properties.

`y = audiorecorder(8000, 16, 1);`

is the most commonly used of the lot. It creates the object, with a Sampling Frequency of 8000Hz, 16bits/sample and mono channel recording. The Sampling Frequency of 8kHz is ideal for speech and voice. But for high pitch music, its ideal to have a higher sampling frequency of the order of 22050Hz or 44.1kHz.

• `record(y);` – starts the recording with control to execute other statements (useful)
• `pause(y);` – pauses recording
• `resume(y);` – resumes recording
• `play(y);` – plays recording done until now and finally
• `signal = getaudiodata(y);` – passes the recording data to the variable called signal

Generally signal processing applications, then would check for saturation, i.e. most of the absolute values of the signal being close to 1 (hampers audibility).

You would now store this recorded signal to your hard disk by
`wavwrite(signal,8000,16,wavfilename);`
Note that the sampling frequency and no. of bits / sample should be kept the same.

These wav files can be read as
`[data, Fs] = wavread(wavfilename);`
The data is the actual -1 to 1 audio information, and Fs the sampling frequency, a key information required to do any further processing.

This `data` can be played using `wavplay(data, Fs)`. This is supported on 32-bit Windows machines. So a better alternative would be to use `soundsc(data, Fs)`. This automatically scales the data vector amplitudes [MIN,MAX] to [-1,1] and then plays the entire data stream with given sampling rate.

More on audio processing using Matlab to come soon…