Letâ€™s assume that computing resources are not the limitation. Remember the famous saying that a picture is worth a thousand words. So letâ€™s apply our computing resources to identify which one thousand words from our lexicon are most relevant to a given picture. Maybe we can even adopt a probabilistic framework and identify N words from our lexicon that are probabilistically significant to the picture. And then we can use these words to represent each image or picture.
This will allow us to use a powerful framework that has been developed for information retrieval beyond detection of words. We can still use inverted file structure and we can use some of the tools like ontologies that help us use domain knowledge or even wordnet that help in using hierarchies and relationships among words. More importantly, this allows us to deal with recognizing words in images as a separate process. That means that image-word recognition problem can be addressed by using machinery from image processing, computer vision, and other application related areas. In fact, in early stages, this could even be a manual or semi-automatic process. In developing manual or semi-automatic approaches, however, careful consideration should be give to the interface between the processes for assigning words and for using words. More independent the two processes are, better the complete system will be.
A by product of this approach maybe creation and addition of new words to the lexicon. These new words maybe meaningful only for images (and video or similar media) and may be marked so. This is not completely new â€“ it has happened in many specialized application areas. For example, fingerprint analysis people have their special words that are not part of the common lexicon, but are definitely part of the extended lexicon. I think it is important to develop such words because that will enrich the language and will help in bringing images, video, audio, and other knowledge modalities under the umbrella of search framework that has established its utility so well in the text domain.