Information Search



 

Introduction

This is where natural language access, by speech or writing, really comes into its own. Readable information is overwhelmingly stored as text, in databases and libraries all over the world, and it takes a few (well-chosen) words to find somewhere the text that is most helpful, whether in one's own files, or some far corner of the World Wide Web.

The key to searching is to choose words which are not too ambiguous. And it is now possible to extend searches through automatic selection of synonyms, or translations and paraphrases into other languages.

Where the Progress is Being Made

Mulinex.   This European project, based at the German Research Center for Artificial Intelligence, fully implemented an approach to multilingual web search, with input possible in English, French and German, and the documents found summarised and translated as necessary into all three languages. Further work is now concerned with the addition of other languages, adding a personal agent system for registered users, and working to cluster the search results effectively. Bridges are also being built to extract given types of information from the documents found.

Cambridge University.   Cambridge University's Engineering Lab developed the Video Mail Retrieval project, which should the feasibility of using word-spotting in audio soundtracks to search video databases. They also showed that there were spin-off benefits in the way that keywords could be used to index large-scale broadcast archives of text. Current work on the Multimedia Document Retrieval project, aims to transcribe and index audio and video material automatically, as well as to integrate this into a probabilistic information retrieval model.

Xerox The Document Company.   This firm is engaged in a Knowledge Brokers project which aims as giving concurrent access to various types of data, notably a combination of the unstructured kind found on the World Wide Web, and the contents of organized and formatted databases.

 

Sources for Products

Bull - Searchway/Mistral   This information server can be combined with the Alis Tango multilingual browser.

Claritech - Natural language Tools for Retrieving and Managing Information'   This company, a spin-off from Carnegie Mellon University in Pittsburgh, works in combination with Just Systems, the biggest PC software company in Japan: ConceptBase 20/1000, their joint product, is a natural language information retrieval tool. (It is said that IBM has plans to upgrade this product with speech recognition input.) There is competition in Japan from the traditional electronics majors: Fujitsu offers Full Search Shunsaku V 1, for large text bases; and NEC offers JTOPIC Family, which can also work over a network, and extend its search targets using synonym dictionaries.

Autonomy - Knowledge Management Suite   This is a complex of many knowledge search and management tools, which works multilingually (in 11 languages, with Thai, Arabic, Japanese and Korean, as well as the major European languages). It incorporates summarization and indexing. It is based not on keywords but Bayesian methods applied to Neural Networks, in other words probability-based inference from correlations between documents which are calculated by the system itself. As such, its search methods are fundamentally akin to those used for signal processing (e.g. speech).

 

Things to Watch Out for

Eurowordnet   Information retrieval, even in its most modern applications, remains dominated by keyword search, whether on titles, abstracts, or full text; no-one has yet found a means of harnessing linguistic knowledge on how the meanings of words are interrelated, although there are increasingly sophisticated ways for users to search through webs of synonyms before selecting the keywords to use.
  • Nua   The volume of material on the web in languages other than English is due to overtake English in 2001.   Multilingual search will therefore increasingly become an essential part of this technology, both for non-English speaking users, and for everyone seeking most of what the World Wide Web has to offer.

  • Different search strategies will be needed to produce the best results for Intranet search (where the format and likely content of databases may be known) as against the World Wide Web, where only data formatting standards are known in advance.

 

If you'd like to learn more about the potential of this technology, from an experienced but completely impartial source, it's time you got in touch with  Linguacubun Ltd  itself.



Linguacubun Ltd. Batheaston Villa, Bailbrook Lane, Bath BA1 7AA UK Tel:+44(0)1225 852865 Fax: +44(0)1225 859258