So, what exactly is the time-frequency problem. In the generation of a spectrogram, its a practice to take a window of a certain size, and then take that time duration signal’s FFT. Thus it generates a time-frequency graph. Now, the size of the window is very important to note. In simple steps
Long window – Bad time resolution – Good frequency resolution – Narrowband spectrogram – Harmonics
Short window – Good time resolution – Bad frequency resolution – Wideband spectrogram – Formants
What do we exactly mean by time and frequency resolution. Time resolution is how well we can point out to a section of the audio and say that this analysed spectrum belongs to this particular part. On the other hand, frequency resolution is saying that this frequency component accurate to a few Hz is part of the windowed speech sample.
So, the wideband spectrogram, shows clear formants (5ms window). But decreasing the time window too much, smears the frequency graph too much making it tough to detect formants. Formants are basically acoustic resonances of the human vocal tract. Our vocal tract continuously changes shape to produce all the sounds that we make. These formants can be used alone to classify all the vowels (essentially voiced phonetics) that we produce. Generally, the F1 and F2 are used to do this. Vowels are found to generally have around 4 formants. There are some other factors like plosives close to vowels, which lower the formant frequencies. Below is a picture of the formants.
Now coming to harmonics, these are nothing but multiples of the fundamental frequency (also called F0) – the pitch. These are seen in narrow-band spectrograms (30ms window). Harmonics have a role to play in speaker recognition and speech recognition too. Below is a picture clearly showing the harmonics in speech.
These images were generated using a software called Praat. May write a post on that sometime.