In these days of Increasingly Bigger Data, it will become very important to understand the big picture that could be derived from this data. There are tow related approaches that one could use to provide this big picture: Summarization and Storytelling. These two are related, but not the same. In summarization, the goal is to look at the data that has been collected and prepare a summary of this data to represent all the data meaningfully and in a way that one could explore it. A commonly used structured representation of the data is the index of the data. An index presents pointers to important locations in the data; it is just an index. An index does give an idea about what is contained in the data. Somebody who knows the data commonly prepares this index. And this is where the strength and limitations of index are. In these days on increasingly bigger data, expecting somebody to analyze data in every area and indexing it may result in bottleneck for applications of the data. Also, when a person prepares an index, the knowledge, biases, and perspective of a person result in an index and the index becomes relatively fixed.
Summarization of data or text is used to represent the data in a manageable size summary such that the summary captures the essence of data for a specific application context.
An automated summarization technique could analyze incoming data and prepare summaries as the data becomes available and could also be tuned for specific applications such that multiple summaries could be developed for different applications using the same data. In some cases, these summaries may even be used as an index to larger set of data.
Pinaki SInha (a doctoral student who worked under my supervision) developed a summarization approach and developed an algorithm for a large collection of photos. Given a large collection of photos, say a few thousand photos, this approach will summarize it by selecting a small number of photos, say between 10 and 20. This approach considers three important parameters: quality, coverage, and diversity. Each photo in the original collection is assigned a quality measure. Then the algorithm uses all meta data available related to photos to cover as many events as possible while selecting maximum diversity among events and represents these by selecting the best representative photo from the event. This is a very simple description of the approach and does not do justice to all the power of this approach, but that is not what we are trying to do here.
Summarization algorithms like the one discussed above for photos are required to be designed and developed for large collection of data. Of course once such summary data is available, then one must consider rendering of this data to make it interesting.