With increasing popularity of video, approaches are being explored to convert speech to text so textual approaches could be used. Interestingly, most speech to text systems work very well when in speech there are minimal emotions. So our search engines want to deal with textual words, not oral words.
Let’s try to consider the world of images, video, audio, and other signals. In fact let’s consider the world and see the role of words in this. Images represent a frozen visual perspective of the real world. Similarly audio signals represent a subset of the world captured at a place using one type of signal. And video is a perspective of the world from a point in synchronized visual and auditory form. Is there an analog of words in these signals? Can there be? This is an interesting question to consider.
Let’s revisit words in text. Words are formed using letters. But what came first words or letters? If you focus narrowly on textual words, then you will start believing that letters came first and by combining some letters – like w, o, r, d, and s – one could form words. But a little though tells us that that could not be the case. Why would any body in right mind develop letters if words were not there? So there must be words before people thought of inventing letters! It is interesting to think why the same three letters – o, g, and d – would result in so different words as dog and god. Immediately an interesting question comes to mind – can we use ‘bag of letters’ in search. The answer seems kind of obvious – NO.
Does that mean that ‘word’ is the atomic unit of semantics? But does that make ‘bag of words’ meaningful. Let’s consider the following two sentences:
A: A dog bites a man.
B: A man bites a dog.
Both these are equivalent in terms of the BoW. But to most people these are very different sentences and convey very different situations. And we learnt from early childhood that a juxtaposition of words results in semantics. Does that mean that the BoW is useless for semantics.