An Overview of Information Retrieval (IR)

Information Retrieval (more precisely termed Full-Text Information Retrieval, if we consider its practical applications) is a field that is to some extent parallel to database technology. It concerns itself with the processing of collections of documents which have no obvious structure (or where any structure that is imposed is artificial). Examples of such collections are a set of hospital discharge summaries or surgery notes, or the full text of the complete works of Shakespeare or the Bible.

For years a relatively "orphan" technology researched by a relative handful of scientists, IR has recently arrived into the mainstream with the advent of Web search engines such as Google, Yahoo, Excite and Alta Vista. For example, both the Netscape and Microsoft Web Servers come with indexing engines (Microsoft's is called Microsoft Index Server). More important, all mainstream relational database engines now have IR add-ons as well: Oracle, MS SQL Server, IBM's DB/2 and Sybase.  Feature-wise, however, the IR offerings of RDBMSs currently lag behind their older stand-alone counterparts. This gap is reducing, however: in some respects (notably support of document collections in multiple languages) Oracle and DB/2 are actually more advanced than many commercial " pure IR" systems.

There are several differences between searching a collection of semi-structured documents and searching a relational database table.

Pre-processing a collection of documents for efficient searching

Even with the power of today's supercomputers, if a huge collection of documents had to be linearly scanned for every occurrence of a word, the search process would take several hours. Search engines need to return results in a minute or so. To achieve speed, they create an index. The process works as follows

Types of Query Methods

There are two methods of searching such indexes. The conceptually simpler (and older) Boolean method allows the user to specify keywords combined with the operators AND, OR and NOT, in a manner familiar to users of Medline. Boolean search methods have been criticized on the following grounds.

Document Vector Approach

To address the issue of relevancy ranking, the technique of Document Vector approach was devised by Gerard Salton (who is to the field of IR what Einstein and Newton were to physics in their days). In this approach, each document is considered as a vector in N-dimensional space, where N is the number of terms in the document.(So, if we have a thousand terms, we are talking about a thousand dimensions. However, the actual math turns out to be very simple.) Here, the user simply types in a list of keywords of interest, and the search engine retrieves matches, ranking them in descending order of relevance. We now describe how this works.

Modern search engines use a combination of Boolean and vector methods. Pure vector-based searching alone returns far too many hits, as anyone who has ever used Alta Vista has painfully discovered. Boolean methods are therefore necessary to restrict the search process. Most Web search engines accept the character + before a word (with no intervening space) to mandate that all retrieved documents must contain that word. Similarly the character - before a word indicates that the retrieved documents should not contain that word. (The best known example of this, given in a New York Times article, is when you want to research Jordan, the country, and you don't want a zillion basketball references coming up. To do this, you type +Jordan -Michael as your search query.) The query engine uses Boolean search to restrict the number of hits, and vector processing for relevance ranking to order the hits.

The IDF * TF approach is rather simplistic, and many commercial engines use refinements. For example, some also use a proprietary weighting scheme based on how close the individual terms in the query are within each document (the closer they are, the higher the weight). Also, the IDF*TF approach is readily fooled by Web pages that contain the same word (which may be an obscene phrase) literally hundreds of times within the document, as meta-tags that are visible to the search engine but not the end-user. The intention of repeating the phrase is to allow such Web pages to be ranked highly when a user is searching for documents containing this phrase. Engines such as Google are much harder to fool. They rank each document based, not only on an IDF * TF approach, but based upon how many other pages refer to these pages. (The other pages, so to speak, "vote" for the pages that they have found the most useful.) The voting pages in turn are voted for by other pages regarding their own utility. (In general, if you want to make your Web pages rank high on Google, you're better off focusing on adding good content that other people then bookmark or hyperlink to.)

The difference between the "vanilla" search engines and the fancier ones is that the latter offer you many more choices over the indexing and search process. They will understand documents in their native formats (e.g., HTML, MS-Word, Wordperfect), will let you add words to the stop word list, and offer proximity searching as an option.

Historical Sidelines on Vector-based Methods:

Just as we can compare a query vector with a document vector, we can also compare document vectors with each other. The document vector to some extent acts like a fingerprint for the author, because particular authors tend to use some words more often than others. This knowledge has been used in various ways.