Words: Visual words – Ramesh Jain

Earlier this week I visited Prof. JiangPing Fan’s group at University of North Carolina, Charlotte. He has been working on developing image segmentation and classification approaches for annotating images. (Disclosure: I am intereacting with him in this research, but my role is small; he is the lead.) Some information and experiences related to this trip are available here.

Prof. Fan has developed a machine learning approach to identify a few words like grass, sand, mountain, sky, water, flower in images based on visual characteristics like colore and texture. But his technique goes far beyond this and he can detect ‘concepts’ like beach, garden, city, mountains scene, etc. This work is reported in technical terms in a paper to appear at ACM Multimedia Conference in October in Santa Barbara. A preprint of this paper is available.

If today one could be reasonably successful in automatically identifying some basic visual objects and a few visual concepts, clearly by scaling it up to a few hundred visual objects and visual concepts, it will be possible to automatically annotate photos with some important tags. And this technology could help in automatic tagging of images.

Now why am I more optimistic about this approach compared to other approaches developed earlier in computer vision community? Because we can start using EXIF data, the meta data associated with images to refine these processes of annotations significantly. I see that finally computer vision approaches may start using other sources of information in identifying visual objects. Moreover, these tags are just tags and in the search engine like environment, people are willing to tolerate significant inaccuracies.

It will be nice to scale up these approaches to say 500 visual objects and 200 visual concepts. In fact even one tenth of those will make these approaches powerful enough to revolutionize how we organize and access photos.