More I think about indexing multimedia data, more I feel that the correct approach may be Context-Audio-Visual (CAV) based indexing rather than current keyword based and feature based approaches.
COntext captures significant information about the environment in which the data was collected. This information may come from many sources ranging from metadata from the devices collecting data to a person collecting data or even other people — even in form of tags.
Audio and visual data could be based on some traditional low-level feature based approaches — but with a fresh look at indexing approaches. Also, one has to look at low level features just as low level features and has to develop to map high level info expressed by users into those low level features. A serious problem in this area appears popularity of machine learning. These approaches do not allow indexing and it may be difficult — in current setting to convert from high level to low level.
In any case, it is clear that a fresh look into this peoblem is really required.