An Overview of Information Retrieval (IR)
Information Retrieval (more precisely termed Full-Text Information Retrieval,
if we consider its practical applications) is a field that is to some extent
parallel to database technology. It concerns itself with the processing of
collections of documents which have no obvious structure (or where any structure
that is imposed is artificial). Examples of such collections are a set of
hospital discharge summaries or surgery notes, or the full text of the complete
works of Shakespeare or the Bible.
For years a relatively "orphan" technology researched by a relative
handful of scientists, IR has recently arrived into the mainstream with the
advent of Web search engines such as Google, Yahoo, Excite and Alta Vista. For
example, both the Netscape and Microsoft Web Servers come with indexing engines
(Microsoft's is called Microsoft Index Server). More important, all mainstream
relational database engines now have IR add-ons as well: Oracle, MS SQL Server,
IBM's DB/2 and Sybase. Feature-wise, however, the IR offerings of RDBMSs
currently lag behind their older stand-alone counterparts. This gap is reducing,
however: in some respects (notably support of document collections in multiple
languages) Oracle and DB/2 are actually more advanced than many commercial
" pure IR" systems.
There are several differences between searching a collection of
semi-structured documents and searching a relational database table.
- Because of synonyms, word variants, abbreviations, acronyms and so forth
(or even misspellings), the searching of a collection of documents using
keywords is not guaranteed to give you all the hits you want. In IR, the
ratio of the number of relevant documents that are retrieved, to the
number of relevant documents actually present in the collection, is
termed recall. When a query method is evaluated with large
collections, recall is estimated by a process of sampling (which involves a
lot of manual labor.
- Because the same word can have multiple meanings in different contexts
(the word "set" is supposed to have more than twenty), retrieval
using keywords will also return certain non-relevant documents. In IR, the
ratio of the number of relevant documents retrieved to the total
(relevant + non-relevant) number of documents retrieved by a query is termed
precision.
- In general, for a given query method, recall and precision vary inversely.
Increasing the precision (by reducing the number of "false
positives") runs the risk of reducing the recall as well, because some
relevant documents may be erroneously rejected. Conversely, increasing the
number of recalled documents increases the risk of many non-relevant
documents being returned. To compare different query methods with each other
for overall effectiveness, various methods are used. One graphical method,
based on ROC (receiver-operator characteristic) plots a graph of recall
versus (1-precision).
Pre-processing a collection of
documents for efficient searching
Even with the power of today's supercomputers, if a huge collection of
documents had to be linearly scanned for every occurrence of a word, the search
process would take several hours. Search engines need to return results in a
minute or so. To achieve speed, they create an index. The process works
as follows
- Each document is broken up into its individual words, and "stop
words" (words with very little value in searching, such as the words
"the", "an", "of" etc.) are removed. Most
programs use a list of the 400 commonest words in English for this purpose.
However, based on the collection, one may want to add more stop words to the
list. Thus, for a collection of documents about nervous system research,
"brain" would be a stop word.
- The remaining words may be further processed by normalization,
which means removing variations of tense, number and person. A more drastic
form of transformation is stemming, which involves removal of
suffixes.. Thus "operating", "operates",
"operated", "operation" etc. get reduced to the root
form "operat". (Stemming is not foolproof, because it generally
uses simple rules of grammar rather than deep vocabulary knowledge. Thus, in
medicine, "renal" and "kidney", which mean the same
thing, would not be stemmed to a common root. Also, the stemming process is
sometimes over-enthusiastic, removing suffixes that clinicians consider
important. Thus "hepatitis" and "hepatoma", which refer
to two quite different liver diseases, are incorrectly truncated to the same
root. Similarly, "modular" and "modulation", which mean
two different things, also yield the same root, "modul".
- Finally, all root terms across all documents are stored in a table and
assigned numbers. This information is used to create two indexes. The global
document-frequency index records, for each term, how many times it
occurs in the entire collection of documents as a whole. The term-frequency
index records, how often a term occurs in each document. Rarely, a
third proximity index is created, which records, for each term, its
position in the document, recorded as word number, paragraph number,
sentence number and so on. (Proximity indexes take up a lot of space.
However, they are useful for searches where you specify two words and
require that they be within the same sentence, or next to each other, or
within so many words of each other.)
- All these types of indexes can be created within the framework of a
relational database- each of them can be implemented as a table. However,
for "pure" IR purposes, a pointer-based structure called an
inverted file is more efficient in terms of space and speed. (De Vries
has argued convincingly, however, that for RDBMSs, where one typically wants
to perform a "mixed" query that reaches to both the tabular and
textual parts of the data, using tables to store the indexes may be a better
way to go, in terms of enabling the database engine to do much more in
optimizing data retiieval.
Types of Query Methods
There are two methods of searching such indexes. The conceptually simpler
(and older) Boolean method allows the user to specify keywords combined
with the operators AND, OR and NOT, in a manner familiar to users of Medline.
Boolean search methods have been criticized on the following grounds.
- It was believed that many users found Boolean logic hard to understand,
and were likely to specify queries incorrectly, returning either too many or
too few answers. This assumption is not quite true, as anyone who has spent
more than fifteen minutes learning Boolean logic can attest.
- The returned documents are not ranked in any way. If numerous documents
are retrieved, the user must scan each one to determine how relevant it is
to the query. This criticism is genuine.
Document Vector Approach
To address the issue of relevancy ranking, the technique of Document
Vector approach was devised by Gerard Salton (who is to the field of IR what
Einstein and Newton were to physics in their days). In this approach, each
document is considered as a vector in N-dimensional space, where N is the number
of terms in the document.(So, if we have a thousand terms, we are talking about
a thousand dimensions. However, the actual math turns out to be very simple.)
Here, the user simply types in a list of keywords of interest, and the search
engine retrieves matches, ranking them in descending order of relevance. We now
describe how this works.
- We derive the vector for each document as follows. (This can be done in
advance, and the results saved so that the document vector can be used in
multiple queries.) Each term in the document is given a weight. This
weight is typically based on what is termed "IDF * TF", where IDF
stands for inverse document frequency and TF stands for term
frequency. ( "*" indicates multiplication.)
- IDF is defined as log (no of documents in the collection/no of documents
containing this word in the collection). This reflects the fact that
uncommon words, when specified in a query, are more likely to useful in
narrowing down the selection of document than very common words.
- TF is defined as log (frequency of term in this document ) This reflects
the fact that if a keyword occurs multiple times in a document, that
document is more likely to be relevant than a document where the keyword
occurs just once.
- The query itself can be regarded as an M-dimensional vector, where M is
the number of terms in the query. Each dimension is given a weight of 1.
- To compare the relevance of a particular document to the query, we compare
the query vector with the document's vector using what is called the normalized
cosine coefficient, which is the angle between the two vectors in space.
(This seems mathematically complex, but it really amounts to simply choosing
only those keywords in the document that are part of the query. If a
document does not contain any of the keywords of interest, it can simply be
rejected.) If the document is "similar" to the query, the angle
between the two vectors will be zero (and cos(0) is equal to one). If the
document and query do not have any terms in common, the angle is 90 degrees
(and cos(90) = 0). Relevance ranking means sorting the matching documents in
descending order of cosine.
- The word "normalized" means that the product of IDF and TF is
adjusted for the length of the document vector, which is the square root of
the sum of the squares of the individual IDF*TFs for each term. The idea
behind normalization is that the weight of a term in a document must be
adjusted for its length. Thus, if a keyword occurs three times in a 300 word
document, that document should be given more weight than a 10,000 word
document where the keyword occurred five times.
Modern search engines use a combination of Boolean and vector methods. Pure
vector-based searching alone returns far too many hits, as anyone who has ever
used Alta Vista has painfully discovered. Boolean methods are therefore
necessary to restrict the search process. Most Web search engines accept the
character + before a word (with no intervening space) to mandate that all
retrieved documents must contain that word. Similarly the character - before a
word indicates that the retrieved documents should not contain that word. (The
best known example of this, given in a New York Times article, is when you want
to research Jordan, the country, and you don't want a zillion basketball
references coming up. To do this, you type +Jordan -Michael as your search
query.) The query engine uses Boolean search to restrict the number of hits, and
vector processing for relevance ranking to order the hits.
The IDF * TF approach is rather simplistic, and many commercial engines use
refinements. For example, some also use a proprietary weighting scheme based on
how close the individual terms in the query are within each document (the closer
they are, the higher the weight). Also, the IDF*TF approach is readily fooled by
Web pages that contain the same word (which may be an obscene phrase) literally
hundreds of times within the document, as meta-tags that are visible to the
search engine but not the end-user. The intention of repeating the phrase is to
allow such Web pages to be ranked highly when a user is searching for documents
containing this phrase. Engines such as Google are much harder to fool. They
rank each document based, not only on an IDF * TF approach, but based upon how
many other pages refer to these pages. (The other pages, so to speak,
"vote" for the pages that they have found the most useful.) The voting
pages in turn are voted for by other pages regarding their own utility. (In
general, if you want to make your Web pages rank high on Google, you're better
off focusing on adding good content that other people then bookmark or hyperlink
to.)
The difference between the "vanilla" search engines and the fancier
ones is that the latter offer you many more choices over the indexing and search
process. They will understand documents in their native formats (e.g., HTML,
MS-Word, Wordperfect), will let you add words to the stop word list, and offer
proximity searching as an option.
Historical Sidelines on Vector-based
Methods:
Just as we can compare a query vector with a document vector, we can also
compare document vectors with each other. The document vector to some extent
acts like a fingerprint for the author, because particular authors tend to use
some words more often than others. This knowledge has been used in various ways.
- Researchers have compared the document vectors of Francis Bacon's known
works with each other and to the plays of Shakespeare. Their analyses
indicated that, statistically, it was unlikely that Bacon was the author of
Shakespeare's plays (as many have believed in the past).
- During the Cold War, :"Kremlin Watchers" were able to classify a
particular Kremlin communique as having come from one of a small handful of
anonymous writers.
- The Pubmed database, maintained by National Center for Biotechnology
Information, was used to identify a case of scientific plagiarism. As
devised by NCBI's John Wilbur, document-similarity information is
precomputed and stored. Given a document, you can find out similar documents
in the database. A researcher found that older documents flagged as highly
similar to papers published by a particular Polish scientist had in fact
been plagiarized by the scientist concerned. (Pubmed was able to identify
similarities despite the fact that it was working with translations in the
latter case.)