1
2This is a version of Henry Spencer's famous regexp implementation. I've
3modified it to meet my needs, this is what I've done:
4
5 2) added a new function regsublen(), this performs a dry run of the
6 regsub() function returning the length of the string needed to hold
7 the output from regsub().
8 3) changed regexec(prog,str) to regexec2(prog,str,eflags) with macro for
9 regexec(). This is so I can have the flag REG_NOTBOL which signifies
10 that the string passed to regexec[2]() is not actually the start of a
11 line.
12 4) support for case-insignificant matching (with the flag REG_NOCASE)
13 5) split the definition of a compiled regexp from regexp.c into
14 a new file regprog.h
15 6) created a new file regjade.c which uses the regexec() structure to
16 match regexp against editor buffers in place.
17 7) Altered the regexp structure to allow storing of subexpressions as
18 positions in a Jade buffer. Also altered calling conventions of
19 regsub() and regsublen() to support this.
20 8) support \w, \W, \s, \S, \d, \D, \b, \B, *?, +?, ?? syntax (as in Perl)
21
22And probably some other things as well. Obviously all errors are my
23responsibility. The original README follows,
24
25John
26
27--
28
29This is a nearly-public-domain reimplementation of the V8 regexp(3) package.
30It gives C programs the ability to use egrep-style regular expressions, and
31does it in a much cleaner fashion than the analogous routines in SysV.
32
33 Copyright (c) 1986 by University of Toronto.
34 Written by Henry Spencer. Not derived from licensed software.
35
36 Permission is granted to anyone to use this software for any
37 purpose on any computer system, and to redistribute it freely,
38 subject to the following restrictions:
39
40 1. The author is not responsible for the consequences of use of
41 this software, no matter how awful, even if they arise
42 from defects in it.
43
44 2. The origin of this software must not be misrepresented, either
45 by explicit claim or by omission.
46
47 3. Altered versions must be plainly marked as such, and must not
48 be misrepresented as being the original software.
49
50Barring a couple of small items in the BUGS list, this implementation is
51believed 100% compatible with V8. It should even be binary-compatible,
52sort of, since the only fields in a "struct regexp" that other people have
53any business touching are declared in exactly the same way at the same
54location in the struct (the beginning).
55
56This implementation is *NOT* AT&T/Bell code, and is not derived from licensed
57software. Even though U of T is a V8 licensee. This software is based on
58a V8 manual page sent to me by Dennis Ritchie (the manual page enclosed
59here is a complete rewrite and hence is not covered by AT&T copyright).
60The software was nearly complete at the time of arrival of our V8 tape.
61I haven't even looked at V8 yet, although a friend elsewhere at U of T has
62been kind enough to run a few test programs using the V8 regexp(3) to resolve
63a few fine points. I admit to some familiarity with regular-expression
64implementations of the past, but the only one that this code traces any
65ancestry to is the one published in Kernighan & Plauger (from which this
66one draws ideas but not code).
67
68Simplistically: put this stuff into a source directory, copy regexp.h into
69/usr/include, inspect Makefile for compilation options that need changing
70to suit your local environment, and then do "make r". This compiles the
71regexp(3) functions, compiles a test program, and runs a large set of
72regression tests. If there are no complaints, then put regexp.o, regsub.o,
73and regerror.o into your C library, and regexp.3 into your manual-pages
74directory.
75
76Note that if you don't put regexp.h into /usr/include *before* compiling,
77you'll have to add "-I." to CFLAGS before compiling.
78
79The files are:
80
81Makefile instructions to make everything
82regexp.3 manual page
83regexp.h header file, for /usr/include
84regexp.c source for regcomp() and regexec()
85regsub.c source for regsub()
86regerror.c source for default regerror()
87regmagic.h internal header file
88try.c source for test program
89timer.c source for timing program
90tests test list for try and timer
91
92This implementation uses nondeterministic automata rather than the
93deterministic ones found in some other implementations, which makes it
94simpler, smaller, and faster at compiling regular expressions, but slower
95at executing them. In theory, anyway. This implementation does employ
96some special-case optimizations to make the simpler cases (which do make
97up the bulk of regular expressions actually used) run quickly. In general,
98if you want blazing speed you're in the wrong place. Replacing the insides
99of egrep with this stuff is probably a mistake; if you want your own egrep
100you're going to have to do a lot more work. But if you want to use regular
101expressions a little bit in something else, you're in luck. Note that many
102existing text editors use nondeterministic regular-expression implementations,
103so you're in good company.
104
105This stuff should be pretty portable, given appropriate option settings.
106If your chars have less than 8 bits, you're going to have to change the
107internal representation of the automaton, although knowledge of the details
108of this is fairly localized. There are no "reserved" char values except for
109NUL, and no special significance is attached to the top bit of chars.
110The string(3) functions are used a fair bit, on the grounds that they are
111probably faster than coding the operations in line. Some attempts at code
112tuning have been made, but this is invariably a bit machine-specific.
113
1
2
3
4
5
6
7 sdbm - Substitute DBM
8 or
9 Berkeley ndbm for Every UN*X[1] Made Simple
10
11 Ozan (oz) Yigit
12
13 The Guild of PD Software Toolmakers
14 Toronto - Canada
15
16 oz@nexus.yorku.ca
17
18
19
20Implementation is the sincerest form of flattery. - L. Peter
21Deutsch
22
23A The Clone of the ndbm library
24
25 The sources accompanying this notice - sdbm - consti-
26tute the first public release (Dec. 1990) of a complete
27clone of the Berkeley UN*X ndbm library. The sdbm library is
28meant to clone the proven functionality of ndbm as closely
29as possible, including a few improvements. It is practical,
30easy to understand, and compatible. The sdbm library is not
31derived from any licensed, proprietary or copyrighted
32software.
33
34 The sdbm implementation is based on a 1978 algorithm
35[Lar78] by P.-A. (Paul) Larson known as ``Dynamic Hashing''.
36In the course of searching for a substitute for ndbm, I pro-
37totyped three different external-hashing algorithms [Lar78,
38Fag79, Lit80] and ultimately chose Larson's algorithm as a
39basis of the sdbm implementation. The Bell Labs dbm (and
40therefore ndbm) is based on an algorithm invented by Ken
41Thompson, [Tho90, Tor87] and predates Larson's work.
42
43 The sdbm programming interface is totally compatible
44with ndbm and includes a slight improvement in database ini-
45tialization. It is also expected to be binary-compatible
46under most UN*X versions that support the ndbm library.
47
48 The sdbm implementation shares the shortcomings of the
49ndbm library, as a side effect of various simplifications to
50the original Larson algorithm. It does produce holes in the
51page file as it writes pages past the end of file. (Larson's
52paper include a clever solution to this problem that is a
53result of using the hash value directly as a block address.)
54On the other hand, extensive tests seem to indicate that
55sdbm creates fewer holes in general, and the resulting page-
56files are smaller. The sdbm implementation is also faster
57than ndbm in database creation. Unlike the ndbm, the sdbm
58_________________________
59
60 [1] UN*X is not a trademark of any (dis)organization.
61
62
63
64
65
66
67
68
69
70 - 2 -
71
72
73store operation will not ``wander away'' trying to split its
74data pages to insert a datum that cannot (due to elaborate
75worst-case situations) be inserted. (It will fail after a
76pre-defined number of attempts.)
77
78Important Compatibility Warning
79
80 The sdbm and ndbm libraries cannot share databases: one
81cannot read the (dir/pag) database created by the other.
82This is due to the differences between the ndbm and sdbm
83algorithms[2], and the hash functions used. It is easy to
84convert between the dbm/ndbm databases and sdbm by ignoring
85the index completely: see dbd, dbu etc.
86
87
88Notice of Intellectual Property
89
90The entire sdbm library package, as authored by me, Ozan S.
91Yigit, is hereby placed in the public domain. As such, the
92author is not responsible for the consequences of use of
93this software, no matter how awful, even if they arise from
94defects in it. There is no expressed or implied warranty for
95the sdbm library.
96
97 Since the sdbm library package is in the public domain,
98this original release or any additional public-domain
99releases of the modified original cannot possibly (by defin-
100ition) be withheld from you. Also by definition, You (singu-
101lar) have all the rights to this code (including the right
102to sell without permission, the right to hoard[3] and the
103right to do other icky things as you see fit) but those
104rights are also granted to everyone else.
105
106 Please note that all previous distributions of this
107software contained a copyright (which is now dropped) to
108protect its origins and its current public domain status
109against any possible claims and/or challenges.
110
111Acknowledgments
112
113 Many people have been very helpful and supportive. A
114partial list would necessarily include Rayan Zacherissen
115(who contributed the man page, and also hacked a MMAP
116_________________________
117
118 [2] Torek's discussion [Tor87] indicates that
119dbm/ndbm implementations use the hash value to traverse
120the radix trie differently than sdbm and as a result,
121the page indexes are generated in different order. For
122more information, send e-mail to the author.
123 [3] You cannot really hoard something that is avail-
124able to the public at large, but try if it makes you
125feel any better.
126
127
128
129
130
131
132
133
134
135
136 - 3 -
137
138
139version of sdbm), Arnold Robbins, Chris Lewis, Bill David-
140sen, Henry Spencer, Geoff Collyer, Rich Salz (who got me
141started in the first place), Johannes Ruschein (who did the
142minix port) and David Tilbrook. I thank you all.
143
144Distribution Manifest and Notes
145
146This distribution of sdbm includes (at least) the following:
147
148 CHANGES change log
149 README this file.
150 biblio a small bibliography on external hashing
151 dba.c a crude (n/s)dbm page file analyzer
152 dbd.c a crude (n/s)dbm page file dumper (for conversion)
153 dbe.1 man page for dbe.c
154 dbe.c Janick's database editor
155 dbm.c a dbm library emulation wrapper for ndbm/sdbm
156 dbm.h header file for the above
157 dbu.c a crude db management utility
158 hash.c hashing function
159 makefile guess.
160 pair.c page-level routines (posted earlier)
161 pair.h header file for the above
162 readme.ms troff source for the README file
163 sdbm.3 man page
164 sdbm.c the real thing
165 sdbm.h header file for the above
166 tune.h place for tuning & portability thingies
167 util.c miscellaneous
168
169 dbu is a simple database manipulation program[4] that
170tries to look like Bell Labs' cbt utility. It is currently
171incomplete in functionality. I use dbu to test out the rou-
172tines: it takes (from stdin) tab separated key/value pairs
173for commands like build or insert or takes keys for commands
174like delete or look.
175
176 dbu <build|creat|look|insert|cat|delete> dbmfile
177
178 dba is a crude analyzer of dbm/sdbm/ndbm page files. It
179scans the entire page file, reporting page level statistics,
180and totals at the end.
181
182 dbd is a crude dump program for dbm/ndbm/sdbm data-
183bases. It ignores the bitmap, and dumps the data pages in
184sequence. It can be used to create input for the dbu util-
185ity. Note that dbd will skip any NULLs in the key and data
186fields, thus is unsuitable to convert some peculiar
187_________________________
188
189 [4] The dbd, dba, dbu utilities are quick hacks and
190are not fit for production use. They were developed
191late one night, just to test out sdbm, and convert some
192databases.
193
194
195
196
197
198
199
200
201
202 - 4 -
203
204
205databases that insist in including the terminating null.
206
207 I have also included a copy of the dbe (ndbm DataBase
208Editor) by Janick Bergeron [janick@bnr.ca] for your pleas-
209ure. You may find it more useful than the little dbu util-
210ity.
211
212 dbm.[ch] is a dbm library emulation on top of ndbm (and
213hence suitable for sdbm). Written by Robert Elz.
214
215 The sdbm library has been around in beta test for quite
216a long time, and from whatever little feedback I received
217(maybe no news is good news), I believe it has been func-
218tioning without any significant problems. I would, of
219course, appreciate all fixes and/or improvements. Portabil-
220ity enhancements would especially be useful.
221
222Implementation Issues
223
224 Hash functions: The algorithm behind sdbm implementa-
225tion needs a good bit-scrambling hash function to be effec-
226tive. I ran into a set of constants for a simple hash func-
227tion that seem to help sdbm perform better than ndbm for
228various inputs:
229
230 /*
231 * polynomial conversion ignoring overflows
232 * 65599 nice. 65587 even better.
233 */
234 long
235 dbm_hash(char *str, int len) {
236 register unsigned long n = 0;
237
238 while (len--)
239 n = n * 65599 + *str++;
240 return n;
241 }
242
243 There may be better hash functions for the purposes of
244dynamic hashing. Try your favorite, and check the pagefile.
245If it contains too many pages with too many holes, (in rela-
246tion to this one for example) or if sdbm simply stops work-
247ing (fails after SPLTMAX attempts to split) when you feed
248your NEWS history file to it, you probably do not have a
249good hashing function. If you do better (for different
250types of input), I would like to know about the function you
251use.
252
253 Block sizes: It seems (from various tests on a few
254machines) that a page file block size PBLKSIZ of 1024 is by
255far the best for performance, but this also happens to limit
256the size of a key/value pair. Depending on your needs, you
257may wish to increase the page size, and also adjust PAIRMAX
258(the maximum size of a key/value pair allowed: should always
259
260
261
262
263
264
265
266
267
268 - 5 -
269
270
271be at least three words smaller than PBLKSIZ.) accordingly.
272The system-wide version of the library should probably be
273configured with 1024 (distribution default), as this appears
274to be sufficient for most common uses of sdbm.
275
276Portability
277
278 This package has been tested in many different UN*Xes
279even including minix, and appears to be reasonably portable.
280This does not mean it will port easily to non-UN*X systems.
281
282Notes and Miscellaneous
283
284 The sdbm is not a very complicated package, at least
285not after you familiarize yourself with the literature on
286external hashing. There are other interesting algorithms in
287existence that ensure (approximately) single-read access to
288a data value associated with any key. These are directory-
289less schemes such as linear hashing [Lit80] (+ Larson varia-
290tions), spiral storage [Mar79] or directory schemes such as
291extensible hashing [Fag79] by Fagin et al. I do hope these
292sources provide a reasonable playground for experimentation
293with other algorithms. See the June 1988 issue of ACM Com-
294putting Surveys [Enb88] for an excellent overview of the
295field.
296
297References
298
299
300[Lar78]
301 P.-A. Larson, ``Dynamic Hashing'', BIT, vol. 18, pp.
302 184-201, 1978.
303
304[Tho90]
305 Ken Thompson, private communication, Nov. 1990
306
307[Lit80]
308 W. Litwin, `` Linear Hashing: A new tool for file and
309 table addressing'', Proceedings of the 6th Conference on
310 Very Large Dabatases (Montreal), pp. 212-223, Very
311 Large Database Foundation, Saratoga, Calif., 1980.
312
313[Fag79]
314 R. Fagin, J. Nievergelt, N. Pippinger, and H. R.
315 Strong, ``Extendible Hashing - A Fast Access Method for
316 Dynamic Files'', ACM Trans. Database Syst., vol. 4,
317 no.3, pp. 315-344, Sept. 1979.
318
319[Wal84]
320 Rich Wales, ``Discussion of "dbm" data base system'',
321 USENET newsgroup unix.wizards, Jan. 1984.
322
323[Tor87]
324 Chris Torek, ``Re: dbm.a and ndbm.a archives'',
325
326
327
328
329
330
331
332
333
334 - 6 -
335
336
337 USENET newsgroup comp.unix, 1987.
338
339[Mar79]
340 G. N. Martin, ``Spiral Storage: Incrementally Augment-
341 able Hash Addressed Storage'', Technical Report #27,
342 University of Varwick, Coventry, U.K., 1979.
343
344[Enb88]
345 R. J. Enbody and H. C. Du, ``Dynamic Hashing
346 Schemes'',ACM Computting Surveys, vol. 20, no. 2, pp.
347 85-113, June 1988.
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397