We will discuss some key issues that make current search limited. My goal is to explore how some of these limitations can be removed to make the next generation search significantly more efficient and effective.
Today I will discuss use of frequency of occurrence of words as the major deciding factor in the relevance of a document to requested search.
We all are familiar with current search engines. When we need to find any information, we go to our favorite search engine and think what keyword(s) will help us find the information that we are interested in. Suppose that I am interested in how popular President Bush is. I can type in George Bush and get lots of information about him. If I type in Bush George, interestingly the sequence of results look very different. In fact, even the number of entries in the result are different. Now since I am interested in his popularity, I decide to type in ¡®Bush Popularity¡¯ and get a list of results. Since the results are mostly related to poll results and other factors in the United States. I want to find out what is happening to his popularity in India. I think ¡®Bush Popularity India¡¯ should give me the results and I try that. I do get a list of several tens of thousands entrees (more than 60 thousand in this case) and go through the first 20 entrees and find that none of them really has anything to do with what I wanted to find out. Most of the entrees appear there because the word ¡®Bush¡¯, ¡®Popularity¡¯, and ¡®India¡¯ somehow appear in the document. They may not have any thing to do with Bush¡¯s popularity in India, it may be just that ¡®Times of India¡¯ was reporting on Bush¡¯s popularity in Unites States, or was reporting that a popular India leader met Bush. All these will trigger equally strong response. On the other hand if an article in Times of India is critical of Bush¡¯s policies, it will not appear in the list because it did not use the term ¡®popular¡¯. Now try finding the distribution of his popularity in different European countries using your favorite search engine.
Keywords are very limited in giving proper information. But if you want to find all documents related to a particular topic then the current methods depend purely on keywords. That¡¯s how the databases are created which are used by search engines. Search engines do use some very powerful linguistic techniques to make sure that your keywords are reduced to ¡®stems¡¯ that capture the basic idea behind the keywords. They also prepare frequency distribution of different words in a document so they can use the frequency to judge the relevance of the document for the search requested. Lately Ontological filtering is also being used (a good friend Prof. Amit Sheth has done some remarkable research in this area and has produced very powerful software that is currently in use by several companies) to find relevant documents among the one that contain that or some equivalent words.
Search engines like Google use lots of factors that help them decide the relevance of a document and rank the list properly. But all these techniques are based only on words, their frequencies (even if you put the word without any context or any sentence they increase the frequency and people do use words in white color on white background to steer search engines in their direction. To judge the relevance ranking, many other techniques like links from a page and to the page are also used. But ultimately these things are restricted by the fact that the original list starts with frequency of words.
Now in our language we all use a very interesting abstraction hierarchy. Consider the following list
Mercedez Benz SL 500
And this kind of hierarchy tells me that if a person is driving SL 500, he is a rich person. I went through the hierarch to come to this kind of conclusion. Such hierarchies are always used by us in our understanding of documents. A word based system lacks this and can not come even partially close to such conclusions. A word based system is completely inadequate when understanding is required.
Of course these systems will be very efficient if you are interested in finding a restaurant by its name, or headlines in newspapers (or research papers) based on certain keywords. So these systems are effective in many applications. And we all love search engines because they do well in such cases and because there is nothing else for more complex cases at this time.
But the fact remains that search approaches based on syntactical and statistical characteristics of word¡¯s occurrence in documents are fundamentally limited and would work for certain applications very well, but will fall apart and result in frustration when you try to use them for anything beyond those applications. And that is not the problem of the search engines, but is of using them where they are not applicable. We don¡¯t use tea spoons to fill a truck carrying dirt for our garden. We also don’t use the size of a person’s forhead to judge how briliant the person is.