xref: /netbsd/lib/libc/regex/regex.3 (revision c4a72b64)
1.\"	$NetBSD: regex.3,v 1.14 2002/10/01 17:06:53 wiz Exp $
2.\"
3.\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
4.\" Copyright (c) 1992, 1993, 1994
5.\"	The Regents of the University of California.  All rights reserved.
6.\"
7.\" This code is derived from software contributed to Berkeley by
8.\" Henry Spencer.
9.\"
10.\" Redistribution and use in source and binary forms, with or without
11.\" modification, are permitted provided that the following conditions
12.\" are met:
13.\" 1. Redistributions of source code must retain the above copyright
14.\"    notice, this list of conditions and the following disclaimer.
15.\" 2. Redistributions in binary form must reproduce the above copyright
16.\"    notice, this list of conditions and the following disclaimer in the
17.\"    documentation and/or other materials provided with the distribution.
18.\" 3. All advertising materials mentioning features or use of this software
19.\"    must display the following acknowledgement:
20.\"	This product includes software developed by the University of
21.\"	California, Berkeley and its contributors.
22.\" 4. Neither the name of the University nor the names of its contributors
23.\"    may be used to endorse or promote products derived from this software
24.\"    without specific prior written permission.
25.\"
26.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
27.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
28.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
29.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
30.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
31.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
32.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
33.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
34.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
35.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
36.\" SUCH DAMAGE.
37.\"
38.\"	@(#)regex.3	8.4 (Berkeley) 3/20/94
39.\"
40.Dd March 20, 1994
41.Dt REGEX 3
42.Os
43.Sh NAME
44.Nm regex ,
45.Nm regcomp ,
46.Nm regexec ,
47.Nm regerror ,
48.Nm regfree
49.Nd regular-expression library
50.Sh LIBRARY
51.Lb libc
52.Sh SYNOPSIS
53.Fd #include \*[Lt]regex.h\*[Gt]
54.Ft int
55.Fn regcomp "regex_t *preg" "const char *pattern" "int cflags"
56.Ft int
57.Fn regexec "const regex_t *preg" "const char *string" "size_t nmatch" "regmatch_t pmatch[]" "int eflags"
58.Ft size_t
59.Fn regerror "int errcode" "const regex_t *preg" "char *errbuf" "size_t errbuf_size"
60.Ft void
61.Fn regfree "regex_t *preg"
62.Sh DESCRIPTION
63These routines implement
64.St -p1003.2-92
65regular expressions (``RE''s);
66see
67.Xr re_format 7 .
68.Fn regcomp
69compiles an RE written as a string into an internal form,
70.Fn regexec
71matches that internal form against a string and reports results,
72.Fn regerror
73transforms error codes from either into human-readable messages,
74and
75.Fn regfree
76frees any dynamically-allocated storage used by the internal form
77of an RE.
78.Pp
79The header
80.Em \*[Lt]regex.h\*[Gt]
81declares two structure types,
82.Fa regex_t
83and
84.Fa regmatch_t ,
85the former for compiled internal forms and the latter for match reporting.
86It also declares the four functions,
87a type
88.Fa regoff_t ,
89and a number of constants with names starting with ``REG_''.
90.Pp
91.Fn regcomp
92compiles the regular expression contained in the
93.Fa pattern
94string,
95subject to the flags in
96.Fa cflags ,
97and places the results in the
98.Fa regex_t
99structure pointed to by
100.Fa preg .
101.Fa cflags
102is the bitwise OR of zero or more of the following flags:
103.Bl -tag -width XXXREG_EXTENDED
104.It Dv REG_EXTENDED
105Compile modern (``extended'') REs, rather than the obsolete
106(``basic'') REs that are the default.
107.It Dv REG_BASIC
108This is a synonym for 0,
109provided as a counterpart to REG_EXTENDED to improve readability.
110.It Dv REG_NOSPEC
111Compile with recognition of all special characters turned off.
112All characters are thus considered ordinary, so the ``RE'' is a literal
113string.
114This is an extension, compatible with but not specified by
115.St -p1003.2-92 ,
116and should be used with caution in software intended to be portable to
117other systems.
118.Dv REG_EXTENDED
119and
120.Dv REG_NOSPEC
121may not be used in the same call to
122.Fn regcomp .
123.It Dv REG_ICASE
124Compile for matching that ignores upper/lower case distinctions.
125See
126.Xr re_format 7 .
127.It Dv REG_NOSUB
128Compile for matching that need only report success or failure, not
129what was matched.
130.It Dv REG_NEWLINE
131Compile for newline-sensitive matching.
132By default, newline is a completely ordinary character with no special
133meaning in either REs or strings.
134With this flag,
135`[^' bracket expressions and `.' never match newline,
136a `^' anchor matches the null string after any newline in the string
137in addition to its normal function,
138and the `$' anchor matches the null string before any newline in the
139string in addition to its normal function.
140.It Dv REG_PEND
141The regular expression ends, not at the first NUL, but just before the
142character pointed to by the
143.Fa re_endp
144member of the structure pointed to by
145.Fa preg .
146The
147.Fa re_endp
148member is of type
149.Fa "const\ char\ *" .
150This flag permits inclusion of NULs in the RE; they are considered
151ordinary characters.
152This is an extension, compatible with but not specified by
153.St -p1003.2-92 ,
154and should be used with caution in software intended to be portable to
155other systems.
156.El
157.Pp
158When successful,
159.Fn regcomp
160returns 0 and fills in the structure pointed to by
161.Fa preg .
162One member of that structure (other than
163.Fa re_endp )
164is publicized:
165.Fa re_nsub ,
166of type
167.Fa size_t ,
168contains the number of parenthesized subexpressions within the RE
169(except that the value of this member is undefined if the
170.Dv REG_NOSUB
171flag was used).
172If
173.Fn regcomp
174fails, it returns a non-zero error code;
175see
176.Sx DIAGNOSTICS .
177.Pp
178.Fn regexec
179matches the compiled RE pointed to by
180.Fa preg
181against the
182.Fa string ,
183subject to the flags in
184.Fa eflags ,
185and reports results using
186.Fa nmatch ,
187.Fa pmatch ,
188and the returned value.
189The RE must have been compiled by a previous invocation of
190.Fn regcomp .
191The compiled form is not altered during execution of
192.Fn regexec ,
193so a single compiled RE can be used simultaneously by multiple threads.
194.Pp
195By default,
196the NUL-terminated string pointed to by
197.Fa string
198is considered to be the text of an entire line, minus any terminating
199newline.
200The
201.Fa eflags
202argument is the bitwise OR of zero or more of the following flags:
203.Bl -tag -width XXXREG_NOTBOL
204.It Dv REG_NOTBOL
205The first character of the string
206is not the beginning of a line, so the `^' anchor should not match before it.
207This does not affect the behavior of newlines under
208.Dv REG_NEWLINE .
209.It Dv REG_NOTEOL
210The NUL terminating the string does not end a line, so the `$' anchor
211should not match before it.
212This does not affect the behavior of newlines under
213.Dv REG_NEWLINE .
214.It Dv REG_STARTEND
215The string is considered to start at
216.Fa string
217+
218.Fa pmatch[0].rm_so
219and to have a terminating NUL located at
220.Fa string
221+
222.Fa pmatch[0].rm_eo
223(there need not actually be a NUL at that location),
224regardless of the value of
225.Fa nmatch .
226See below for the definition of
227.Fa pmatch
228and
229.Fa nmatch .
230This is an extension, compatible with but not specified by
231.St -p1003.2-92 ,
232and should be used with caution in software intended to be portable to
233other systems.
234Note that a non-zero
235.Fa rm_so
236does not imply
237.Dv REG_NOTBOL ;
238.Dv REG_STARTEND
239affects only the location of the string, not how it is matched.
240.El
241.Pp
242See
243.Xr re_format 7
244for a discussion of what is matched in situations where an RE or a
245portion thereof could match any of several substrings of
246.Fa string .
247.Pp
248Normally,
249.Fn regexec
250returns 0 for success and the non-zero code
251.Dv REG_NOMATCH
252for failure.
253Other non-zero error codes may be returned in exceptional situations;
254see
255.Sx DIAGNOSTICS .
256.Pp
257If
258.Dv REG_NOSUB
259was specified in the compilation of the RE, or if
260.Fa nmatch
261is 0,
262.Fn regexec
263ignores the
264.Fa pmatch
265argument (but see below for the case where
266.Dv REG_STARTEND
267is specified).
268Otherwise,
269.Fa pmatch
270points to an array of
271.Fa nmatch
272structures of type
273.Fa regmatch_t .
274Such a structure has at least the members
275.Fa rm_so
276and
277.Fa rm_eo ,
278both of type
279.Fa regoff_t
280(a signed arithmetic type at least as large as an
281.Fa off_t
282and a
283.Fa ssize_t ) ,
284containing respectively the offset of the first character of a substring
285and the offset of the first character after the end of the substring.
286Offsets are measured from the beginning of the
287.Fa string
288argument given to
289.Fn regexec .
290An empty substring is denoted by equal offsets,
291both indicating the character following the empty substring.
292.Pp
293The 0th member of the
294.Fa pmatch
295array is filled in to indicate what substring of
296.Fa string
297was matched by the entire RE.
298Remaining members report what substring was matched by parenthesized
299subexpressions within the RE;
300member
301.Fa i
302reports subexpression
303.Fa i ,
304with subexpressions counted (starting at 1) by the order of their
305opening parentheses in the RE, left to right.
306Unused entries in the array\(emcorresponding either to subexpressions that
307did not participate in the match at all, or to subexpressions that do not
308exist in the RE (that is,
309.Fa i
310\*[Gt]
311.Fa preg-\*[Gt]re_nsub )
312\(emhave both
313.Fa rm_so
314and
315.Fa rm_eo
316set to -1.
317If a subexpression participated in the match several times,
318the reported substring is the last one it matched.
319(Note, as an example in particular, that when the RE `(b*)+' matches `bbb',
320the parenthesized subexpression matches each of the three `b's and then
321an infinite number of empty strings following the last `b',
322so the reported substring is one of the empties.)
323.Pp
324If
325.Dv REG_STARTEND
326is specified,
327.Fa pmatch
328must point to at least one
329.Fa regmatch_t
330(even if
331.Fa nmatch
332is 0 or
333.Dv REG_NOSUB
334was specified),
335to hold the input offsets for
336.Dv REG_STARTEND .
337Use for output is still entirely controlled by
338.Fa nmatch ;
339if
340.Fa nmatch
341is 0 or
342.Dv REG_NOSUB
343was specified,
344the value of
345.Fa pmatch [0]
346will not be changed by a successful
347.Fn regexec .
348.Pp
349.Fn regerror
350maps a non-zero
351.Fa errcode
352from either
353.Fn regcomp
354or
355.Fn regexec
356to a human-readable, printable message.
357If
358.Fa preg
359is non-NULL,
360the error code should have arisen from use of the
361.Fa regex_t
362pointed to by
363.Fa preg ,
364and if the error code came from
365.Fn regcomp ,
366it should have been the result from the most recent
367.Fn regcomp
368using that
369.Fa regex_t . (
370.Fn regerror
371may be able to supply a more detailed message using information
372from the
373.Fa regex_t . )
374.Fn regerror
375places the NUL-terminated message into the buffer pointed to by
376.Fa errbuf ,
377limiting the length (including the NUL) to at most
378.Fa errbuf_size
379bytes.
380If the whole message won't fit,
381as much of it as will fit before the terminating NUL is supplied.
382In any case,
383the returned value is the size of buffer needed to hold the whole
384message (including terminating NUL).
385If
386.Fa errbuf_size
387is 0,
388.Fa errbuf
389is ignored but the return value is still correct.
390.Pp
391If the
392.Fa errcode
393given to
394.Fn regerror
395is first ORed with
396.Dv REG_ITOA ,
397the ``message'' that results is the printable name of the error code,
398e.g. ``REG_NOMATCH'',
399rather than an explanation thereof.
400If
401.Fa errcode
402is
403.Dv REG_ATOI ,
404then
405.Fa preg
406shall be non-NULL and the
407.Fa re_endp
408member of the structure it points to
409must point to the printable name of an error code;
410in this case, the result in
411.Fa errbuf
412is the decimal digits of
413the numeric value of the error code
414(0 if the name is not recognized).
415.Dv REG_ITOA
416and
417.Dv REG_ATOI
418are intended primarily as debugging facilities;
419they are extensions, compatible with but not specified by
420.St -p1003.2-92 ,
421and should be used with caution in software intended to be portable to
422other systems.
423Be warned also that they are considered experimental and changes are possible.
424.Pp
425.Fn regfree
426frees any dynamically-allocated storage associated with the compiled RE
427pointed to by
428.Fa preg .
429The remaining
430.Fa regex_t
431is no longer a valid compiled RE
432and the effect of supplying it to
433.Fn regexec
434or
435.Fn regerror
436is undefined.
437.Pp
438None of these functions references global variables except for tables
439of constants;
440all are safe for use from multiple threads if the arguments are safe.
441.Sh IMPLEMENTATION CHOICES
442There are a number of decisions that
443.St -p1003.2-92
444leaves up to the implementor,
445either by explicitly saying ``undefined'' or by virtue of them being
446forbidden by the RE grammar.
447This implementation treats them as follows.
448.Pp
449See
450.Xr re_format 7
451for a discussion of the definition of case-independent matching.
452.Pp
453There is no particular limit on the length of REs,
454except insofar as memory is limited.
455Memory usage is approximately linear in RE size, and largely insensitive
456to RE complexity, except for bounded repetitions.
457See BUGS for one short RE using them
458that will run almost any system out of memory.
459.Pp
460A backslashed character other than one specifically given a magic meaning
461by
462.St -p1003.2-92
463(such magic meanings occur only in obsolete [``basic''] REs)
464is taken as an ordinary character.
465.Pp
466Any unmatched [ is a
467.Dv REG_EBRACK
468error.
469.Pp
470Equivalence classes cannot begin or end bracket-expression ranges.
471The endpoint of one range cannot begin another.
472.Pp
473.Dv RE_DUP_MAX ,
474the limit on repetition counts in bounded repetitions, is 255.
475.Pp
476A repetition operator (?, *, +, or bounds) cannot follow another
477repetition operator.
478A repetition operator cannot begin an expression or subexpression
479or follow `^' or `|'.
480.Pp
481`|' cannot appear first or last in a (sub)expression or after another `|',
482i.e. an operand of `|' cannot be an empty subexpression.
483An empty parenthesized subexpression, `()', is legal and matches an
484empty (sub)string.
485An empty string is not a legal RE.
486.Pp
487A `{' followed by a digit is considered the beginning of bounds for a
488bounded repetition, which must then follow the syntax for bounds.
489A `{' \fInot\fR followed by a digit is considered an ordinary character.
490.Pp
491`^' and `$' beginning and ending subexpressions in obsolete (``basic'')
492REs are anchors, not ordinary characters.
493.Sh DIAGNOSTICS
494Non-zero error codes from
495.Fn regcomp
496and
497.Fn regexec
498include the following:
499.Pp
500.Bl -tag -width XXXREG_ECOLLATE -compact
501.It Dv REG_NOMATCH
502regexec() failed to match
503.It Dv REG_BADPAT
504invalid regular expression
505.It Dv REG_ECOLLATE
506invalid collating element
507.It Dv REG_ECTYPE
508invalid character class
509.It Dv REG_EESCAPE
510\e applied to unescapable character
511.It Dv REG_ESUBREG
512invalid backreference number
513.It Dv REG_EBRACK
514brackets [ ] not balanced
515.It Dv REG_EPAREN
516parentheses ( ) not balanced
517.It Dv REG_EBRACE
518braces { } not balanced
519.It Dv REG_BADBR
520invalid repetition count(s) in { }
521.It Dv REG_ERANGE
522invalid character range in [ ]
523.It Dv REG_ESPACE
524ran out of memory
525.It Dv REG_BADRPT
526?, *, or + operand invalid
527.It Dv REG_EMPTY
528empty (sub)expression
529.It Dv REG_ASSERT
530``can't happen''\(emyou found a bug
531.It Dv REG_INVARG
532invalid argument, e.g. negative-length string
533.El
534.Sh SEE ALSO
535.Xr grep 1 ,
536.Xr sed 1 ,
537.Xr re_format 7
538.Pp
539.St -p1003.2-92 ,
540sections 2.8 (Regular Expression Notation)
541and
542B.5 (C Binding for Regular Expression Matching).
543.Sh HISTORY
544Originally written by Henry Spencer.
545Altered for inclusion in the
546.Bx 4.4
547distribution.
548.Sh BUGS
549This is an alpha release with known defects.
550Please report problems.
551.Pp
552There is one known functionality bug.
553The implementation of internationalization is incomplete:
554the locale is always assumed to be the default one of
555.St -p1003.2-92 ,
556and only the collating elements etc. of that locale are available.
557.Pp
558The back-reference code is subtle and doubts linger about its correctness
559in complex cases.
560.Pp
561.Fn regexec
562performance is poor.
563This will improve with later releases.
564.Fa nmatch
565exceeding 0 is expensive;
566.Fa nmatch
567exceeding 1 is worse.
568.Fa regexec
569is largely insensitive to RE complexity
570.Em except
571that back references are massively expensive.
572RE length does matter; in particular, there is a strong speed bonus
573for keeping RE length under about 30 characters,
574with most special characters counting roughly double.
575.Pp
576.Fn regcomp
577implements bounded repetitions by macro expansion,
578which is costly in time and space if counts are large
579or bounded repetitions are nested.
580An RE like, say,
581`((((a{1,100}){1,100}){1,100}){1,100}){1,100}'
582will (eventually) run almost any existing machine out of swap space.
583.Pp
584There are suspected problems with response to obscure error conditions.
585Notably,
586certain kinds of internal overflow,
587produced only by truly enormous REs or by multiply nested bounded repetitions,
588are probably not handled well.
589.Pp
590Due to a mistake in
591.St -p1003.2-92 ,
592things like `a)b' are legal REs because `)' is a special character
593only in the presence of a previous unmatched `('.
594This can't be fixed until the spec is fixed.
595.Pp
596The standard's definition of back references is vague.
597For example, does
598`a\e(\e(b\e)*\e2\e)*d' match `abbbd'?
599Until the standard is clarified, behavior in such cases should not be
600relied on.
601.Pp
602The implementation of word-boundary matching is a bit of a kludge,
603and bugs may lurk in combinations of word-boundary matching and anchoring.
604