Say we are in a crowded and noisy place (a party for example) where lots of people are talking and someone is talking to us too. Even in such surroundings, we can clearly talk and listen with a specific person who we want to talk with. So we are in a way rejecting a lot of unwanted noise that is present in the surroundings. Yet, if our name is called out by someone somewhere, we hear that and shift our response towards that direction.. and all this happens automatically 🙂 This brings to question, are we actually listening to all that junk and checking whether its relevant to us ?
Another beautiful example is the “what-did-you-say” phenomenon. When someone asks us a question, and we are not attentive often our reaction is “uh, what did you say?” But then, before the question is repeated, we manage to comprehend what was told to us a few seconds ago from memory. This experiment was actually tried in laboratory, the results agreed with the above which confirm: there is a temporary memory for sounds to which we are not attending, but this is not a long-term memory.
On the other hand, our speech recognition systems today, fail miserably on only having noise in the to-be-recognized speech. Forget multiple people talking and such reality issues. In view of this amazing capability of the brain, the research is on to provide what they call it Auditory Scene Analysis. Put as another example, a baby, imitates his mother’s voice, but does not insert the cradle squeaks that have occurred simultaneously when the mother was speaking with him. The baby successfully rejects the squeaks as not being part of the perceptual object formed by the mother’s voice. This probably is just a way of saying that our systems today are not even at par to babies 😦 and looking at the bright side; well, there’s a lot of work to do 🙂
So some of the ideas developed so far are to use the spatial location and continuity of the sound source, difference in loudness, visual channel cues which may be influencing our brain to a high extent. We know for certain that a dog cannot talk like a human, and thus we have done some sense of basic scene analysis, but the computer is yet to understand that. In recent days, Harmonics and Frequency Modulation, the Pitch Trajectory, are seen to provide much needed understanding into auditory scene analysis. Visual cues and Auditory source location and continuity together seem to be promising too.
These has varying applications from multi-user conference systems, where multiple people could speak together, to our own handhelds which are now losing their keypads (Nokia 5800). Conversing with a mobile should ensure that only our voice is being heard and separated else a shut-down said somewhere far away could cause problems 🙂
A summary of these ideas will be to provide as much continuity within a foreground stream as possible, while making them as differentiable from other background streams as is practical, without adding so many effects that the required data becomes messy.