xref: /original-bsd/lib/libcompat/regexp/regexp.3 (revision 817cfbae)
1.\" Copyright 1991 The Regents of the University of California.
2.\" All rights reserved.
3.\"
4.\" %sccs.include.redist.man%
5.\"
6.\"     @(#)regexp.3	5.2 (Berkeley) 04/20/91
7.\"
8.Dd
9.Dt REGEXP 3
10.Os
11.Sh NAME
12.Nm regcomp ,
13.Nm regexec ,
14.Nm regsub ,
15.Nm regerror
16.Nd regular expression handlers
17.Sh SYNOPSIS
18.Fd #include <regexp.h>
19.Ft regexp *
20.Fn regcomp "const char *exp"
21.Ft int
22.Fn regexec "const regexp *prog" "const char *string"
23.Ft void
24.Fn regsub "const regexp *prog" "const char *source" "char *dest"
25.Sh DESCRIPTION
26The
27.Fn regcomp ,
28.Fn regexec ,
29.Fn regsub ,
30and
31.Fn regerror
32functions
33implement
34.Xr egrep 1 Ns -style
35regular expressions and supporting facilities.
36.Pp
37The
38.Fn regcomp
39function
40compiles a regular expression into a structure of type
41.Xr regexp ,
42and returns a pointer to it.
43The space has been allocated using
44.Xr malloc 3
45and may be released by
46.Xr free .
47.Pp
48The
49.Fn regexec
50function
51matches a
52.Dv NUL Ns -terminated
53.Fa string
54against the compiled regular expression
55in
56.Fa prog .
57It returns 1 for success and 0 for failure, and adjusts the contents of
58.Fa prog Ns 's
59.Em startp
60and
61.Em endp
62(see below) accordingly.
63.Pp
64The members of a
65.Xr regexp
66structure include at least the following (not necessarily in order):
67.Bd -literal -offset indent
68char *startp[NSUBEXP];
69char *endp[NSUBEXP];
70.Ed
71.Pp
72where
73.Dv NSUBEXP
74is defined (as 10) in the header file.
75Once a successful
76.Fn regexec
77has been done using the
78.Fn regexp ,
79each
80.Em startp Ns - Em endp
81pair describes one substring
82within the
83.Fa string ,
84with the
85.Em startp
86pointing to the first character of the substring and
87the
88.Em endp
89pointing to the first character following the substring.
90The 0th substring is the substring of
91.Fa string
92that matched the whole
93regular expression.
94The others are those substrings that matched parenthesized expressions
95within the regular expression, with parenthesized expressions numbered
96in left-to-right order of their opening parentheses.
97.Pp
98The
99.Fn regsub
100function
101copies
102.Fa source
103to
104.Fa dest ,
105making substitutions according to the
106most recent
107.Fn regexec
108performed using
109.Fa prog .
110Each instance of `&' in
111.Fa source
112is replaced by the substring
113indicated by
114.Em startp Ns Bq
115and
116.Em endp Ns Bq .
117Each instance of
118.Sq \e Ns Em n ,
119where
120.Em n
121is a digit, is replaced by
122the substring indicated by
123.Em startp Ns Bq Em n
124and
125.Em endp Ns Bq Em n .
126To get a literal `&' or
127.Sq \e Ns Em n
128into
129.Fa dest ,
130prefix it with `\e';
131to get a literal `\e' preceding `&' or
132.Sq \e Ns Em n ,
133prefix it with
134another `\e'.
135.Pp
136The
137.Fn regerror
138function
139is called whenever an error is detected in
140.Fn regcomp ,
141.Fn regexec ,
142or
143.Fn regsub .
144The default
145.Fn regerror
146writes the string
147.Fa msg ,
148with a suitable indicator of origin,
149on the standard
150error output
151and invokes
152.Xr exit 2 .
153The
154.Fn regerror
155function
156can be replaced by the user if other actions are desirable.
157.Sh REGULAR EXPRESSION SYNTAX
158A regular expression is zero or more
159.Em branches ,
160separated by `|'.
161It matches anything that matches one of the branches.
162.Pp
163A branch is zero or more
164.Em pieces ,
165concatenated.
166It matches a match for the first, followed by a match for the second, etc.
167.Pp
168A piece is an
169.Em atom
170possibly followed by `*', `+', or `?'.
171An atom followed by `*' matches a sequence of 0 or more matches of the atom.
172An atom followed by `+' matches a sequence of 1 or more matches of the atom.
173An atom followed by `?' matches a match of the atom, or the null string.
174.Pp
175An atom is a regular expression in parentheses (matching a match for the
176regular expression), a
177.Em range
178(see below), `.'
179(matching any single character), `^' (matching the null string at the
180beginning of the input string), `$' (matching the null string at the
181end of the input string), a `\e' followed by a single character (matching
182that character), or a single character with no other significance
183(matching that character).
184.Pp
185A
186.Em range
187is a sequence of characters enclosed in `[]'.
188It normally matches any single character from the sequence.
189If the sequence begins with `^',
190it matches any single character
191.Em not
192from the rest of the sequence.
193If two characters in the sequence are separated by `\-', this is shorthand
194for the full list of
195.Tn ASCII
196characters between them
197(e.g. `[0-9]' matches any decimal digit).
198To include a literal `]' in the sequence, make it the first character
199(following a possible `^').
200To include a literal `\-', make it the first or last character.
201.Sh AMBIGUITY
202If a regular expression could match two different parts of the input string,
203it will match the one which begins earliest.
204If both begin in the same place but match different lengths, or match
205the same length in different ways, life gets messier, as follows.
206.Pp
207In general, the possibilities in a list of branches are considered in
208left-to-right order, the possibilities for `*', `+', and `?' are
209considered longest-first, nested constructs are considered from the
210outermost in, and concatenated constructs are considered leftmost-first.
211The match that will be chosen is the one that uses the earliest
212possibility in the first choice that has to be made.
213If there is more than one choice, the next will be made in the same manner
214(earliest possibility) subject to the decision on the first choice.
215And so forth.
216.Pp
217For example,
218.Sq Li (ab|a)b*c
219could match
220`abc' in one of two ways.
221The first choice is between `ab' and `a'; since `ab' is earlier, and does
222lead to a successful overall match, it is chosen.
223Since the `b' is already spoken for,
224the `b*' must match its last possibility\(emthe empty string\(emsince
225it must respect the earlier choice.
226.Pp
227In the particular case where no `|'s are present and there is only one
228`*', `+', or `?', the net effect is that the longest possible
229match will be chosen.
230So
231.Sq Li ab* ,
232presented with `xabbbby', will match `abbbb'.
233Note that if
234.Sq Li ab* ,
235is tried against `xabyabbbz', it
236will match `ab' just after `x', due to the begins-earliest rule.
237(In effect, the decision on where to start the match is the first choice
238to be made, hence subsequent choices must respect it even if this leads them
239to less-preferred alternatives.)
240.Sh RETURN VALUES
241The
242.Fn regcomp
243function
244returns
245.Dv NULL
246for a failure
247.Pf ( Fn regerror
248permitting),
249where failures are syntax errors, exceeding implementation limits,
250or applying `+' or `*' to a possibly-null operand.
251.Sh SEE ALSO
252.Xr ed 1 ,
253.Xr ex 1 ,
254.Xr expr 1 ,
255.Xr egrep 1 ,
256.Xr fgrep 1 ,
257.Xr grep 1 ,
258.Xr regex 3
259.Sh HISTORY
260Both code and manual page for
261.Fn regcomp ,
262.Fn regexec ,
263.Fn regsub ,
264and
265.Fn regerror
266were written at the University of Toronto
267and appeared in
268.Bx 4.3 tahoe .
269They are intended to be compatible with the Bell V8
270.Xr regexp 3 ,
271but are not derived from Bell code.
272.Sh BUGS
273Empty branches and empty regular expressions are not portable to V8.
274.Pp
275The restriction against
276applying `*' or `+' to a possibly-null operand is an artifact of the
277simplistic implementation.
278.Pp
279Does not support
280.Xr egrep Ns 's
281newline-separated branches;
282neither does the V8
283.Xr regexp 3 ,
284though.
285.Pp
286Due to emphasis on
287compactness and simplicity,
288it's not strikingly fast.
289It does give special attention to handling simple cases quickly.
290