Based on research literature, trade literature, and even popular press it is very easy to get the impression that semantics will be no issue because we are now going to attach metadata to all forms of data. It appears that the current belief is that once metadata is available, semantics problems are solved. Wouldn¡¯t it be nice, if this was even close to truth?
So what is metadata and how does it solve the semantics problem. Well metadata, simply said, is data about data. It really is some information on how data should be interpreted. Thus a number can be salary, street address, age, social security number, or any other information about a person or any other entity. Once we are told that it is annual salary, we will interpret it very differently than if we were told that the number really represents the street address. In most cases, human beings are very good in utilizing context to interpret a number. If I am told John makes 121, 750.00, then I know that this number represents his salary. On the other hand if I see
John 858 295 3281
Then I immediately know that this number represents his telephone number. Humans are usually quite good in understanding what a number means, or the semantics of the number in computer jargon, but algorithms to understand the context from text or data have proven to be surprisingly difficult to develop. When researchers found that it was going to take significant amount of time to develop algorithms to understand text and this was becoming a true bottleneck in emerging applications, they started using metadata. The idea was simple and very innovative – for each important data item that needs to be interpreted, use a tag that tells what the data is. Thus
<salary> 121750.00 </salary>
<telephone> 858 295 3281 </telephone>
will be used to represent the semantics of these two numbers explicitly using the tags. Notice two important things about this notation (from XML): the tags (or the metadata) are inserted just before and after the data and are human understandable. There are many reasons for this and there are many interesting implications, but we will not get into details here.
Most search techniques used in current search techniques were developed to infer semantics based on distribution of words. Since these techniques were mostly syntactical and could only bring very limited semantics, people have been trying to improve the results of search approaches by using ontological filtering. Presence of metadata does help in applying approaches to refine semantics significantly. That is the reason that metadata is being increasingly used in many applications. In fact the whole notion of Semantic Web being championed by Time Barners-Lee is based on the notion of using this meta data extensively. Many applications of Semantic Web are being developed and the concept of web services is becoming increasingly popular. All this is the result of the use of metadata.
It is clear that we still have serious progress to make in understanding the semantics of documents as well as we would like to. This is because metadata only provide semantics for limited amount of data and that too in local sense. Algorithms still need to be developed to get a clear understanding of the total document.
A very interesting thing about the Web that people are not yet considering as seriously as it should be is the rapidly changing nature of the type of data on the Web. In 1994, most of the data, if not 100% of the data, on the web was text and databases that contained only alphanumeric data with strong semantics. Today I would guess that about 15-20% of the data is images, video, audio and other sensory data. And if we project to year 2009, then at least 50% of the data will be non-textual. And this percentage will keep increasing until a very small percentage, but a very important information source, will be text.
What are the implications of this changing nature of information for finding information on the Web? One obvious thing is that people will be more interested in information independent of the media. Thus people will not care whether they get information from video, or text, or audio – they will only want to get the information. That means that search engines will have to combine text, audio, images, video, and other information sources to find correlated data that will provide information. The importance of the medium will completely disappear – only the message will be important.
Can metadata be used in this effectively? What will be the form of metadata? How will it be used?