Summarization



 

Introduction

Faced with the vast overload of potentially relevant information, language technology is also providing some solutions to the problem of making one's way through it, cutting down the bulk of longer texts, while preserving (in understandable form) their major points and perhaps references to points of specific interest.

Amazingly, for general summaries the most practical techniques do not try to build up an "understanding" of texts at all. Rather they adopt the Salience approach, ie: they use word frequency counts to identify and preserve the most significant sentences, measured against the background of words used in the rest of the article. This has the advantages of being language-independent as a technique, and also guaranteeing that the results are at least grammatical.

Another important application is to build up a cumulative record of interesting patterns of events that may be reported in a vast volume of free text. This is called Information Extraction: it might for example work through a set of casualty reports to create a database, from which an analysis of trends or underlying causation might emerge. Here, the best techniques employ a kind of template, looking to fit the activities described into a given pattern: so it makes sense to see the systems as having a rudimentary "understanding", distinguishing e.g. agents from victims or locations.

 

Where the Progress is Being Made

At the University of Sheffield   various approaches are being explored to improve the efficacy of information extraction: in ECRAN, for example, they have developed an approach based on Galois lattices to recognize new contexts, words and senses.

 

Sources for Products

This technology is already built into some widespread word processors: e.g. AutoSummarize in Word 97. Apple's Information Access Toolkit (once known as V-Twin) is also available for developers.

To note three proprietary suppliers among many:

  • Glucose Development Corporation's Data Hammer   combines the Salience approach with the generation of headline-like summary titles.

  • Cognos   prides itself on the generation of reports, customizable to particular classes of user.

  • Lernout & Hauspie  offers the Intelliscope Retrieval Toolkit. This includes summarization which can be biased to the specified interests of particular users. It incorporates a small-level of grammatical analysis, identifying noun-phrases, stripping endings and identifying the likely reference of pronouns.

 

Things to Watch Out for

  • Document management is increasingly becoming an integrated activity, with summarization provided as just one facility amidst a batch of others: word processing, search, translation, multimedia graphics and publishing.

  • Summarization of a document aimed at focusing on the main point of the original author will typically produce results quite different from a process of noting or headlining biased to the interests of particular readers or users.

 

If you'd like to learn more about the potential of this technology, from an experienced but completely impartial source, it's time you got in touch with  Linguacubun Ltd  itself.



Linguacubun Ltd. Batheaston Villa, Bailbrook Lane, Bath BA1 7AA UK Tel:+44(0)1225 852865 Fax: +44(0)1225 859258