xref: /netbsd/lib/libcompat/regexp/regexp.3 (revision 6550d01e)
1.\" Copyright (c) 1991, 1993
2.\"	The Regents of the University of California.  All rights reserved.
3.\"
4.\" Redistribution and use in source and binary forms, with or without
5.\" modification, are permitted provided that the following conditions
6.\" are met:
7.\" 1. Redistributions of source code must retain the above copyright
8.\"    notice, this list of conditions and the following disclaimer.
9.\" 2. Redistributions in binary form must reproduce the above copyright
10.\"    notice, this list of conditions and the following disclaimer in the
11.\"    documentation and/or other materials provided with the distribution.
12.\" 3. Neither the name of the University nor the names of its contributors
13.\"    may be used to endorse or promote products derived from this software
14.\"    without specific prior written permission.
15.\"
16.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
17.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
18.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
19.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
20.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
21.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
22.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
23.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
24.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
25.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
26.\" SUCH DAMAGE.
27.\"
28.\"     from: @(#)regexp.3	8.1 (Berkeley) 6/4/93
29.\"	$NetBSD: regexp.3,v 1.16 2008/09/08 22:14:11 apb Exp $
30.\"
31.Dd June 4, 1993
32.Dt REGEXP 3
33.Os
34.Sh NAME
35.Nm regcomp ,
36.Nm regexec ,
37.Nm regsub ,
38.Nm regerror
39.Nd obsolete "'regexp'" regular expression handlers
40.Sh LIBRARY
41.Lb libcompat
42.Sh SYNOPSIS
43.In regexp.h
44.Ft regexp *
45.Fn regcomp "const char *exp"
46.Ft int
47.Fn regexec "const regexp *prog" "const char *string"
48.Ft void
49.Fn regsub "const regexp *prog" "const char *source" "char *dest"
50.Ft void
51.Fn regerror "const char *msg"
52.Sh DESCRIPTION
53.Bf -symbolic
54This interface is made obsolete by
55.Xr regex 3 .
56It is available from the compatibility library, libcompat.
57.Ef
58.Pp
59The
60.Fn regcomp ,
61.Fn regexec ,
62.Fn regsub ,
63and
64.Fn regerror
65functions implement
66.Xr egrep 1 Ns -style
67regular expressions and supporting facilities.
68.Pp
69The
70.Fn regcomp
71function
72compiles a regular expression into a structure of type
73.Em regexp ,
74and returns a pointer to it.
75The space has been allocated using
76.Xr malloc 3
77and may be released by
78.Xr free 3 .
79.Pp
80The
81.Fn regexec
82function
83matches a
84.Dv NUL Ns -terminated
85.Fa string
86against the compiled regular expression
87in
88.Fa prog .
89It returns 1 for success and 0 for failure, and adjusts the contents of
90.Fa prog Ns 's
91.Em startp
92and
93.Em endp
94(see below) accordingly.
95.Pp
96The members of a
97.Em regexp
98structure include at least the following (not necessarily in order):
99.Bd -literal -offset indent
100char *startp[NSUBEXP];
101char *endp[NSUBEXP];
102.Ed
103.Pp
104where
105.Dv NSUBEXP
106is defined (as 10) in the header file.
107Once a successful
108.Fn regexec
109has been done using the
110.Fn regexp ,
111each
112.Em startp Ns - Em endp
113pair describes one substring
114within the
115.Fa string ,
116with the
117.Em startp
118pointing to the first character of the substring and
119the
120.Em endp
121pointing to the first character following the substring.
122The 0th substring is the substring of
123.Fa string
124that matched the whole
125regular expression.
126The others are those substrings that matched parenthesized expressions
127within the regular expression, with parenthesized expressions numbered
128in left-to-right order of their opening parentheses.
129.Pp
130The
131.Fn regsub
132function
133copies
134.Fa source
135to
136.Fa dest ,
137making substitutions according to the
138most recent
139.Fn regexec
140performed using
141.Fa prog .
142Each instance of `\*[Am]' in
143.Fa source
144is replaced by the substring
145indicated by
146.Em startp Ns Bq
147and
148.Em endp Ns Bq .
149Each instance of
150.Sq \e Ns Em n ,
151where
152.Em n
153is a digit, is replaced by
154the substring indicated by
155.Em startp Ns Bq Em n
156and
157.Em endp Ns Bq Em n .
158To get a literal `\*[Am]' or
159.Sq \e Ns Em n
160into
161.Fa dest ,
162prefix it with `\e';
163to get a literal `\e' preceding `\*[Am]' or
164.Sq \e Ns Em n ,
165prefix it with
166another `\e'.
167.Pp
168The
169.Fn regerror
170function
171is called whenever an error is detected in
172.Fn regcomp ,
173.Fn regexec ,
174or
175.Fn regsub .
176The default
177.Fn regerror
178writes the string
179.Fa msg ,
180with a suitable indicator of origin,
181on the standard
182error output
183and invokes
184.Xr exit 3 .
185The
186.Fn regerror
187function
188can be replaced by the user if other actions are desirable.
189.Sh REGULAR EXPRESSION SYNTAX
190A regular expression is zero or more
191.Em branches ,
192separated by `|'.
193It matches anything that matches one of the branches.
194.Pp
195A branch is zero or more
196.Em pieces ,
197concatenated.
198It matches a match for the first, followed by a match for the second, etc.
199.Pp
200A piece is an
201.Em atom
202possibly followed by `*', `+', or `?'.
203An atom followed by `*' matches a sequence of 0 or more matches of the atom.
204An atom followed by `+' matches a sequence of 1 or more matches of the atom.
205An atom followed by `?' matches a match of the atom, or the null string.
206.Pp
207An atom is a regular expression in parentheses (matching a match for the
208regular expression), a
209.Em range
210(see below), `.'
211(matching any single character), `^' (matching the null string at the
212beginning of the input string), `$' (matching the null string at the
213end of the input string), a `\e' followed by a single character (matching
214that character), or a single character with no other significance
215(matching that character).
216.Pp
217A
218.Em range
219is a sequence of characters enclosed in `[]'.
220It normally matches any single character from the sequence.
221If the sequence begins with `^',
222it matches any single character
223.Em not
224from the rest of the sequence.
225If two characters in the sequence are separated by `\-', this is shorthand
226for the full list of
227.Tn ASCII
228characters between them
229(e.g. `[0-9]' matches any decimal digit).
230To include a literal `]' in the sequence, make it the first character
231(following a possible `^').
232To include a literal `\-', make it the first or last character.
233.Sh AMBIGUITY
234If a regular expression could match two different parts of the input string,
235it will match the one which begins earliest.
236If both begin in the same place but match different lengths, or match
237the same length in different ways, life gets messier, as follows.
238.Pp
239In general, the possibilities in a list of branches are considered in
240left-to-right order, the possibilities for `*', `+', and `?' are
241considered longest-first, nested constructs are considered from the
242outermost in, and concatenated constructs are considered leftmost-first.
243The match that will be chosen is the one that uses the earliest
244possibility in the first choice that has to be made.
245If there is more than one choice, the next will be made in the same manner
246(earliest possibility) subject to the decision on the first choice.
247And so forth.
248.Pp
249For example,
250.Sq Li (ab|a)b*c
251could match
252`abc' in one of two ways.
253The first choice is between `ab' and `a'; since `ab' is earlier, and does
254lead to a successful overall match, it is chosen.
255Since the `b' is already spoken for,
256the `b*' must match its last possibility\(emthe empty string\(emsince
257it must respect the earlier choice.
258.Pp
259In the particular case where no `|'s are present and there is only one
260`*', `+', or `?', the net effect is that the longest possible
261match will be chosen.
262So
263.Sq Li ab* ,
264presented with `xabbbby', will match `abbbb'.
265Note that if
266.Sq Li ab* ,
267is tried against `xabyabbbz', it
268will match `ab' just after `x', due to the begins-earliest rule.
269(In effect, the decision on where to start the match is the first choice
270to be made, hence subsequent choices must respect it even if this leads them
271to less-preferred alternatives.)
272.Sh RETURN VALUES
273The
274.Fn regcomp
275function
276returns
277.Dv NULL
278for a failure
279.Pf ( Fn regerror
280permitting),
281where failures are syntax errors, exceeding implementation limits,
282or applying `+' or `*' to a possibly-null operand.
283.Sh SEE ALSO
284.Xr ed 1 ,
285.Xr egrep 1 ,
286.Xr ex 1 ,
287.Xr expr 1 ,
288.Xr fgrep 1 ,
289.Xr grep 1 ,
290.Xr regex 3
291.Sh HISTORY
292Both code and manual page for
293.Fn regcomp ,
294.Fn regexec ,
295.Fn regsub ,
296and
297.Fn regerror
298were written at the University of Toronto
299and appeared in
300.Bx 4.3 tahoe .
301They are intended to be compatible with the Bell V8
302.Xr regexp 3 ,
303but are not derived from Bell code.
304.Sh BUGS
305Empty branches and empty regular expressions are not portable to V8.
306.Pp
307The restriction against
308applying `*' or `+' to a possibly-null operand is an artifact of the
309simplistic implementation.
310.Pp
311Does not support
312.Xr egrep 1 Ns 's
313newline-separated branches;
314neither does the V8
315.Xr regexp 3 ,
316though.
317.Pp
318Due to emphasis on
319compactness and simplicity,
320it's not strikingly fast.
321It does give special attention to handling simple cases quickly.
322