xref: /original-bsd/lib/libcompat/regexp/regexp.3 (revision 02e832b2)
1.\" Copyright 1991 The Regents of the University of California.
2.\" All rights reserved.
3.\"
4.\" %sccs.include.redist.man%
5.\"
6.\"     @(#)regexp.3	5.3 (Berkeley) 08/05/92
7.\"
8.Dd
9.Dt REGEXP 3
10.Os
11.Sh NAME
12.Nm regcomp ,
13.Nm regexec ,
14.Nm regsub ,
15.Nm regerror
16.Nd regular expression handlers
17.Sh SYNOPSIS
18.Fd #include <regexp.h>
19.Ft regexp *
20.Fn regcomp "const char *exp"
21.Ft int
22.Fn regexec "const regexp *prog" "const char *string"
23.Ft void
24.Fn regsub "const regexp *prog" "const char *source" "char *dest"
25.Sh DESCRIPTION
26This interface is made obsolete by
27.Xr regex 3 .
28.Pp
29The
30.Fn regcomp ,
31.Fn regexec ,
32.Fn regsub ,
33and
34.Fn regerror
35functions
36implement
37.Xr egrep 1 Ns -style
38regular expressions and supporting facilities.
39.Pp
40The
41.Fn regcomp
42function
43compiles a regular expression into a structure of type
44.Xr regexp ,
45and returns a pointer to it.
46The space has been allocated using
47.Xr malloc 3
48and may be released by
49.Xr free .
50.Pp
51The
52.Fn regexec
53function
54matches a
55.Dv NUL Ns -terminated
56.Fa string
57against the compiled regular expression
58in
59.Fa prog .
60It returns 1 for success and 0 for failure, and adjusts the contents of
61.Fa prog Ns 's
62.Em startp
63and
64.Em endp
65(see below) accordingly.
66.Pp
67The members of a
68.Xr regexp
69structure include at least the following (not necessarily in order):
70.Bd -literal -offset indent
71char *startp[NSUBEXP];
72char *endp[NSUBEXP];
73.Ed
74.Pp
75where
76.Dv NSUBEXP
77is defined (as 10) in the header file.
78Once a successful
79.Fn regexec
80has been done using the
81.Fn regexp ,
82each
83.Em startp Ns - Em endp
84pair describes one substring
85within the
86.Fa string ,
87with the
88.Em startp
89pointing to the first character of the substring and
90the
91.Em endp
92pointing to the first character following the substring.
93The 0th substring is the substring of
94.Fa string
95that matched the whole
96regular expression.
97The others are those substrings that matched parenthesized expressions
98within the regular expression, with parenthesized expressions numbered
99in left-to-right order of their opening parentheses.
100.Pp
101The
102.Fn regsub
103function
104copies
105.Fa source
106to
107.Fa dest ,
108making substitutions according to the
109most recent
110.Fn regexec
111performed using
112.Fa prog .
113Each instance of `&' in
114.Fa source
115is replaced by the substring
116indicated by
117.Em startp Ns Bq
118and
119.Em endp Ns Bq .
120Each instance of
121.Sq \e Ns Em n ,
122where
123.Em n
124is a digit, is replaced by
125the substring indicated by
126.Em startp Ns Bq Em n
127and
128.Em endp Ns Bq Em n .
129To get a literal `&' or
130.Sq \e Ns Em n
131into
132.Fa dest ,
133prefix it with `\e';
134to get a literal `\e' preceding `&' or
135.Sq \e Ns Em n ,
136prefix it with
137another `\e'.
138.Pp
139The
140.Fn regerror
141function
142is called whenever an error is detected in
143.Fn regcomp ,
144.Fn regexec ,
145or
146.Fn regsub .
147The default
148.Fn regerror
149writes the string
150.Fa msg ,
151with a suitable indicator of origin,
152on the standard
153error output
154and invokes
155.Xr exit 2 .
156The
157.Fn regerror
158function
159can be replaced by the user if other actions are desirable.
160.Sh REGULAR EXPRESSION SYNTAX
161A regular expression is zero or more
162.Em branches ,
163separated by `|'.
164It matches anything that matches one of the branches.
165.Pp
166A branch is zero or more
167.Em pieces ,
168concatenated.
169It matches a match for the first, followed by a match for the second, etc.
170.Pp
171A piece is an
172.Em atom
173possibly followed by `*', `+', or `?'.
174An atom followed by `*' matches a sequence of 0 or more matches of the atom.
175An atom followed by `+' matches a sequence of 1 or more matches of the atom.
176An atom followed by `?' matches a match of the atom, or the null string.
177.Pp
178An atom is a regular expression in parentheses (matching a match for the
179regular expression), a
180.Em range
181(see below), `.'
182(matching any single character), `^' (matching the null string at the
183beginning of the input string), `$' (matching the null string at the
184end of the input string), a `\e' followed by a single character (matching
185that character), or a single character with no other significance
186(matching that character).
187.Pp
188A
189.Em range
190is a sequence of characters enclosed in `[]'.
191It normally matches any single character from the sequence.
192If the sequence begins with `^',
193it matches any single character
194.Em not
195from the rest of the sequence.
196If two characters in the sequence are separated by `\-', this is shorthand
197for the full list of
198.Tn ASCII
199characters between them
200(e.g. `[0-9]' matches any decimal digit).
201To include a literal `]' in the sequence, make it the first character
202(following a possible `^').
203To include a literal `\-', make it the first or last character.
204.Sh AMBIGUITY
205If a regular expression could match two different parts of the input string,
206it will match the one which begins earliest.
207If both begin in the same place but match different lengths, or match
208the same length in different ways, life gets messier, as follows.
209.Pp
210In general, the possibilities in a list of branches are considered in
211left-to-right order, the possibilities for `*', `+', and `?' are
212considered longest-first, nested constructs are considered from the
213outermost in, and concatenated constructs are considered leftmost-first.
214The match that will be chosen is the one that uses the earliest
215possibility in the first choice that has to be made.
216If there is more than one choice, the next will be made in the same manner
217(earliest possibility) subject to the decision on the first choice.
218And so forth.
219.Pp
220For example,
221.Sq Li (ab|a)b*c
222could match
223`abc' in one of two ways.
224The first choice is between `ab' and `a'; since `ab' is earlier, and does
225lead to a successful overall match, it is chosen.
226Since the `b' is already spoken for,
227the `b*' must match its last possibility\(emthe empty string\(emsince
228it must respect the earlier choice.
229.Pp
230In the particular case where no `|'s are present and there is only one
231`*', `+', or `?', the net effect is that the longest possible
232match will be chosen.
233So
234.Sq Li ab* ,
235presented with `xabbbby', will match `abbbb'.
236Note that if
237.Sq Li ab* ,
238is tried against `xabyabbbz', it
239will match `ab' just after `x', due to the begins-earliest rule.
240(In effect, the decision on where to start the match is the first choice
241to be made, hence subsequent choices must respect it even if this leads them
242to less-preferred alternatives.)
243.Sh RETURN VALUES
244The
245.Fn regcomp
246function
247returns
248.Dv NULL
249for a failure
250.Pf ( Fn regerror
251permitting),
252where failures are syntax errors, exceeding implementation limits,
253or applying `+' or `*' to a possibly-null operand.
254.Sh SEE ALSO
255.Xr ed 1 ,
256.Xr ex 1 ,
257.Xr expr 1 ,
258.Xr egrep 1 ,
259.Xr fgrep 1 ,
260.Xr grep 1 ,
261.Xr regex 3
262.Sh HISTORY
263Both code and manual page for
264.Fn regcomp ,
265.Fn regexec ,
266.Fn regsub ,
267and
268.Fn regerror
269were written at the University of Toronto
270and appeared in
271.Bx 4.3 tahoe .
272They are intended to be compatible with the Bell V8
273.Xr regexp 3 ,
274but are not derived from Bell code.
275.Sh BUGS
276Empty branches and empty regular expressions are not portable to V8.
277.Pp
278The restriction against
279applying `*' or `+' to a possibly-null operand is an artifact of the
280simplistic implementation.
281.Pp
282Does not support
283.Xr egrep Ns 's
284newline-separated branches;
285neither does the V8
286.Xr regexp 3 ,
287though.
288.Pp
289Due to emphasis on
290compactness and simplicity,
291it's not strikingly fast.
292It does give special attention to handling simple cases quickly.
293