@(#)refer 8.1 (Berkeley) 06/08/93
.... refer | tbl | nroff -ms .EH 'USD:27-%''Some Applications of Inverted Indexes on the UNIX System' .OH 'Some Applications of Inverted Indexes on the UNIX System''USD:27-%' .nr LL 6.5i .nr LT 6.5i \\s-2\\$1\\s0\\$2 .. .TR 69 \".TM 77-1274-17 39199 39199-11 .ND October 27, 1977 .ND June 21, 1978 Some Applications of Inverted Indexes on the UNIX System .AU "MH 2C-572" 6377 M. E. Lesk .AI .MH .AB
.LP
.ft B
I. Some Applications of Inverted Indexes - Overview
.ft R
.PP
This memorandum describes a set of programs which
make inverted indexes to
UNIX*
text files, and their
application to
retrieving and formatting citations for documents prepared using
.I troff.
.PP
The indexing and searching programs make keyword
indexes to volumes of material too large for linear searching.
Searches for combinations of single words can be performed quickly.
The programs for general searching are divided into
two phases. The first makes an index from the original
data; the second searches the index and retrieves
items.
Both of these phases are further divided into two parts
to separate the data-dependent and algorithm dependent
code.
.PP
The major current application of these programs is
the
.I troff
preprocessor
.I refer.
A list of 4300 references is maintained on line,
containing primarily papers written and cited by
local authors.
Whenever one of these references is required
in a paper, a few words from the title or author list
will retrieve it, and the user need not bother to re-enter
the exact citation.
Alternatively, authors can use their own lists of papers.
.PP
This memorandum is of interest to
those who are interested in facilities for searching large
but relatively unchanging text files on
the
UNIX
system,
and those who are interested in handling bibliographic
citations with
UNIX
.I troff.
.LP
.ft B
II. Updating Publication Lists
.PP
This section is a brief note describing the
auxiliary programs for managing the updating
processing.
It is written to aid clerical users in
maintaining lists of references.
Primarily, the programs described permit a large
amount of individual control over the content
of publication lists while retaining the
usefulness of the files to other users.
.LP
.ft B
III. Manual Pages
.PP
This section contains the pages from the
UNIX programmer's manual
dealing with these commands.
It is useful for reference.
.sp
\l'3i'
.br
* UNIX is a trademark of Bell Laboratories.
.AE
.CS 10 4 14 0 0 4 Introduction.
The X system has many utilities (e.g. grep, awk, lex, egrep, fgrep, ...) to search through files of text, but most of them are based on a linear scan through the entire file, using some deterministic automaton. .ev 1 .vs 10p .ev This memorandum discusses a program which uses inverted indexes .[ %A D. Knuth %T The Art of Computer Programming: Vol. 3, Sorting and Searching %I Addison-Wesley %C Reading, Mass. %D 1977 %O See section 6.5. .] and can thus be used on much larger data bases.
As with any indexing system, of course, there are some disadvantages; once an index is made, the files that have been indexed can not be changed without remaking the index. Thus applications are restricted to those making many searches of relatively stable data. Furthermore, these programs depend on hashing, and can only search for exact matches of whole keywords. It is not possible to look for arithmetic or logical expressions (e.g. ``date greater than 1970'') or for regular expression searching such as that in lex. .[ lex lesk cstr .]
Currently there are two uses of this software, the refer preprocessor to format references, and the lookall command to search through all text files on the X system.\(dd .FS \(dd lookall is not part of the Berkeley UNIX distribution. .FE
The remaining sections of this memorandum discuss the searching programs and their uses. Section 2 explains the operation of the searching algorithm and describes the data collected for use with the lookall command. The more important application, refer has a user's description in section 3. Section 4 goes into more detail on reference files for the benefit of those who wish to add references to data bases or write new troff macros for use with refer. The options to make refer collect identical citations, or otherwise relocate and adjust references, are described in section 5.
Searching.The indexing and searching process is divided into two phases, each made of two parts. These are shown below.
The first phase, making the index, is presumably done relatively infrequently. It should, of course, be done whenever the data being indexed change. In contrast, the second phase, retrieving items, is presumably done often, and must be rapid.
An effort is made to separate code which depends on the data being handled from code which depends on the searching procedure. The search algorithm is involved only in programs (2) and (3), while knowledge of the actual data files is needed only by programs (1) and (4). Thus it is easy to adapt to different data files or different search algorithms.
To start with, it is necessary to have some way of selecting or generating keys from input files. For dealing with files that are basically English, we have a key-making program which automatically selects words and passes them to the hashing and sorting program (step 2). The format used has one line for each input item, arranged as follows: name:start,length (tab) key1 key2 key3 ... where name is the file name, start is the starting byte number, and length is the number of bytes in the entry.
These lines are the only input used to make the index. The first field (the file name, byte position, and byte count) is the tag of the item and can be used to retrieve it quickly. Normally, an item is either a whole file or a section of a file delimited by blank lines. After the tab, the second field contains the keys. The keys, if selected by the automatic program, are any alphanumeric strings which are not among the 100 most frequent words in English and which are not entirely numeric (except for four-digit numbers beginning 19, which are accepted as dates). Keys are truncated to six characters and converted to lower case. Some selection is needed if the original items are very large. We normally just take the first n keys, with n less than 100 or so; this replaces any attempt at intelligent selection. One file in our system is a complete English dictionary; it would presumably be retrieved for all queries.
To generate an inverted index to the list of record tags and keys, the keys are hashed and sorted to produce an index. What is wanted, ideally, is a series of lists showing the tags associated with each key. To condense this, what is actually produced is a list showing the tags associated with each hash code, and thus with some set of keys. To speed up access and further save space, a set of three or possibly four files is produced. These files are: .KS .bd 2 2
File Contents |
entry Pointers to posting file |
for each hash code |
posting Lists of tag pointers for |
each hash code |
tag Tags for each item |
key Keys for each item |
(optional) |
The searching process starts with a query, containing several keys. The goal is to obtain all items which were indexed under these keys. The query keys are hashed, and the pointers in the entry file used to access the lists in the posting file. These lists are addresses in the tag file of documents posted under the hash codes derived from the query. The common items from all lists are determined; this must include the items indexed by every key, but may also contain some items which are false drops, since items referenced by the correct hash codes need not actually have contained the correct keys. Normally, if there are several keys in the query, there are not likely to be many false drops in the final combined list even though each hash code is somewhat ambiguous. The actual tags are then obtained from the tag file, and to guard against the possibility that an item has false-dropped on some hash code in the query, the original items are normally obtained from the delivery program (4) and the query keys checked against them by string comparison.
Usually, therefore, the check for bad drops is made against the original file. However, if the key derivation procedure is complex, it may be preferable to check against the keys fed to program (2). In this case the optional key file which contains the keys associated with each item is generated, and the item tag is supplemented by a string ;start,length which indicates the starting byte number in the key file and the length of the string of keys for each item. This file is not usually necessary with the present key-selection program, since the keys always appear in the original document.
There is also an option (\f3-C\f2n\|\f1) for coordination level searching. This retrieves items which match all but n of the query keys. The items are retrieved in the order of the number of keys that they match. Of course, n must be less than the number of query keys (nothing is retrieved unless it matches at least one key).
As an example, consider one set of 4377 references, comprising 660,000 bytes. This included 51,000 keys, of which 5,900 were distinct keys. The hash table is kept full to save space (at the expense of time); 995 of 997 possible hash codes were used. The total set of index files (no key file) included 171,000 bytes, about 26% of the original file size. It took 8 minutes of processor time to hash, sort, and write the index. To search for a single query with the resulting index took 1.9 seconds of processor time, while to find the same paper with a sequential linear search using grep (reading all of the tags and keys) took 12.3 seconds of processor time.
We have also used this software to index all of the English stored on our X system. This is the index searched by the lookall command. On a typical day there were 29,000 files in our user file system, containing about 152,000,000 bytes. Of these 5,300 files, containing 32,000,000 bytes (about 21%) were English text. The total number of `words' (determined mechanically) was 5,100,000. Of these 227,000 were selected as keys; 19,000 were distinct, hashing to 4,900 (of 5,000 possible) different hash codes. The resulting inverted file indexes used 845,000 bytes, or about 2.6% of the size of the original files. The particularly small indexes are caused by the fact that keys are taken from only the first 50 non-common words of some very long input files.
Even this large \f2lookall\f1 index can be searched quickly. For example, to find this document by looking for the keys ``lesk inverted indexes'' required 1.7 seconds of processor time and system time. By comparison, just to search the 800,000 byte dictionary (smaller than even the inverted indexes, let alone the 27,000,000 bytes of text files) with grep takes 29 seconds of processor time. The lookall program is thus useful when looking for a document which you believe is stored on-line, but do not know where. For example, many memos from our center are in the file system, but it is often difficult to guess where a particular memo might be (it might have several authors, each with many directories, and have been worked on by a secretary with yet more directories). Instructions for the use of the lookall command are given in the manual section, shown in the appendix to this memorandum.
The only indexes maintained routinely are those of publication lists and all English files. To make other indexes, the programs for making keys, sorting them, searching the indexes, and delivering answers must be used. Since they are usually invoked as parts of higher-level commands, they are not in the default command directory, but are available to any user in the directory /usr/lib/refer . Three programs are of interest: mkey , which isolates keys from input files; inv , which makes an index from a set of keys; and hunt , which searches the index and delivers the items. Note that the two parts of the retrieval phase are combined into one program, to avoid the excessive system work and delay which would result from running these as separate processes.
These three commands have a large number of options to adapt to different kinds of input. The user not interested in the detailed description that now follows may skip to section 3, which describes the refer program, a packaged-up version of these tools specifically oriented towards formatting references.
Make Keys. .R The program mkey is the key-making program corresponding to step (1) in phase A. Normally, it reads its input from the file names given as arguments, and if there are no arguments it reads from the standard input. It assumes that blank lines in the input delimit separate items, for each of which a different line of keys should be generated. The lines of keys are written on the standard output. Keys are any alphanumeric string in the input not among the most frequent words in English and not entirely numeric (except that all-numeric strings are acceptable if they are between 1900 and 1999). In the output, keys are translated to lower case, and truncated to six characters in length; any associated punctuation is removed. The following flag arguments are recognized by mkey:
-c \f2name | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Name of file of common words; | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
default is | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
/usr/lib/eign. T}
-f \f2name T{
Read a list of files from
name and take each as an input argument.
T}
-i \f2chars T{
Ignore all lines which begin with `%' followed by any character
in
chars . T}
-k\f2n T{
Use at most
n keys per input item.
T}
-l\f2n T{
Ignore items shorter than
n letters long.
T}
-n\f2m T{
Ignore as a key any word in the first
m words of the list of common English words.
The default is 100.
T}
-s T{
Remove the labels
(file:start,length) from the output; just give the keys.
Used when searching rather than indexing.
T}
-w T{
Each whole file is a separate item;
blank lines in files are irrelevant.
T}
.TE
The normal arguments for indexing references are the defaults, which are "-c /usr/lib/eign" , -n100 , and -l3 . For searching, the -s option is also needed. When the big lookall index of all English files is run, the options are -w , -k50 , and "-f (filelist)" . When running on textual input, the mkey program processes about 1000 English words per processor second. Unless the -k option is used (and the input files are long enough for it to take effect) the output of mkey is comparable in size to its input. Hash and invert. .R The inv program computes the hash codes and writes the inverted files. It reads the output of mkey and writes the set of files described earlier in this section. It expects one argument, which is used as the base name for the three (or four) files to be written. Assuming an argument of Index (the default) the entry file is named Index.ia , the posting file Index.ib , the tag file Index.ic , and the key file (if present) Index.id . The inv program recognizes the following options:
|