xref: /netbsd/lib/libc/regex/regex.3 (revision 6550d01e)
1.\"	$NetBSD: regex.3,v 1.21 2010/03/22 19:30:54 joerg Exp $
2.\"
3.\" Copyright (c) 1992, 1993, 1994
4.\"	The Regents of the University of California.  All rights reserved.
5.\"
6.\" This code is derived from software contributed to Berkeley by
7.\" Henry Spencer.
8.\"
9.\" Redistribution and use in source and binary forms, with or without
10.\" modification, are permitted provided that the following conditions
11.\" are met:
12.\" 1. Redistributions of source code must retain the above copyright
13.\"    notice, this list of conditions and the following disclaimer.
14.\" 2. Redistributions in binary form must reproduce the above copyright
15.\"    notice, this list of conditions and the following disclaimer in the
16.\"    documentation and/or other materials provided with the distribution.
17.\" 3. Neither the name of the University nor the names of its contributors
18.\"    may be used to endorse or promote products derived from this software
19.\"    without specific prior written permission.
20.\"
21.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
22.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
23.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
24.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
25.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
26.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
27.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
28.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
29.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
30.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
31.\" SUCH DAMAGE.
32.\"
33.\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
34.\"
35.\" This code is derived from software contributed to Berkeley by
36.\" Henry Spencer.
37.\"
38.\" Redistribution and use in source and binary forms, with or without
39.\" modification, are permitted provided that the following conditions
40.\" are met:
41.\" 1. Redistributions of source code must retain the above copyright
42.\"    notice, this list of conditions and the following disclaimer.
43.\" 2. Redistributions in binary form must reproduce the above copyright
44.\"    notice, this list of conditions and the following disclaimer in the
45.\"    documentation and/or other materials provided with the distribution.
46.\" 3. All advertising materials mentioning features or use of this software
47.\"    must display the following acknowledgement:
48.\"	This product includes software developed by the University of
49.\"	California, Berkeley and its contributors.
50.\" 4. Neither the name of the University nor the names of its contributors
51.\"    may be used to endorse or promote products derived from this software
52.\"    without specific prior written permission.
53.\"
54.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
55.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
56.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
57.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
58.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
59.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
60.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
61.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
62.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
63.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
64.\" SUCH DAMAGE.
65.\"
66.\"	@(#)regex.3	8.4 (Berkeley) 3/20/94
67.\"
68.Dd December 29, 2003
69.Dt REGEX 3
70.Os
71.Sh NAME
72.Nm regex ,
73.Nm regcomp ,
74.Nm regexec ,
75.Nm regerror ,
76.Nm regfree
77.Nd regular-expression library
78.Sh LIBRARY
79.Lb libc
80.Sh SYNOPSIS
81.In regex.h
82.Ft int
83.Fn regcomp "regex_t * restrict preg" "const char * restrict pattern" "int cflags"
84.Ft int
85.Fn regexec "const regex_t * restrict preg" "const char * restrict string" "size_t nmatch" "regmatch_t pmatch[]" "int eflags"
86.Ft size_t
87.Fn regerror "int errcode" "const regex_t * restrict preg" "char * restrict errbuf" "size_t errbuf_size"
88.Ft void
89.Fn regfree "regex_t *preg"
90.Sh DESCRIPTION
91These routines implement
92.St -p1003.2-92
93regular expressions (``RE''s);
94see
95.Xr re_format 7 .
96.Fn regcomp
97compiles an RE written as a string into an internal form,
98.Fn regexec
99matches that internal form against a string and reports results,
100.Fn regerror
101transforms error codes from either into human-readable messages,
102and
103.Fn regfree
104frees any dynamically-allocated storage used by the internal form
105of an RE.
106.Pp
107The header
108.In regex.h
109declares two structure types,
110.Fa regex_t
111and
112.Fa regmatch_t ,
113the former for compiled internal forms and the latter for match reporting.
114It also declares the four functions,
115a type
116.Fa regoff_t ,
117and a number of constants with names starting with ``REG_''.
118.Pp
119.Fn regcomp
120compiles the regular expression contained in the
121.Fa pattern
122string,
123subject to the flags in
124.Fa cflags ,
125and places the results in the
126.Fa regex_t
127structure pointed to by
128.Fa preg .
129.Fa cflags
130is the bitwise OR of zero or more of the following flags:
131.Bl -tag -width XXXREG_EXTENDED
132.It Dv REG_EXTENDED
133Compile modern (``extended'') REs, rather than the obsolete
134(``basic'') REs that are the default.
135.It Dv REG_BASIC
136This is a synonym for 0,
137provided as a counterpart to REG_EXTENDED to improve readability.
138.It Dv REG_NOSPEC
139Compile with recognition of all special characters turned off.
140All characters are thus considered ordinary, so the ``RE'' is a literal
141string.
142This is an extension, compatible with but not specified by
143.St -p1003.2-92 ,
144and should be used with caution in software intended to be portable to
145other systems.
146.Dv REG_EXTENDED
147and
148.Dv REG_NOSPEC
149may not be used in the same call to
150.Fn regcomp .
151.It Dv REG_ICASE
152Compile for matching that ignores upper/lower case distinctions.
153See
154.Xr re_format 7 .
155.It Dv REG_NOSUB
156Compile for matching that need only report success or failure, not
157what was matched.
158.It Dv REG_NEWLINE
159Compile for newline-sensitive matching.
160By default, newline is a completely ordinary character with no special
161meaning in either REs or strings.
162With this flag,
163`[^' bracket expressions and `.' never match newline,
164a `^' anchor matches the null string after any newline in the string
165in addition to its normal function,
166and the `$' anchor matches the null string before any newline in the
167string in addition to its normal function.
168.It Dv REG_PEND
169The regular expression ends, not at the first NUL, but just before the
170character pointed to by the
171.Fa re_endp
172member of the structure pointed to by
173.Fa preg .
174The
175.Fa re_endp
176member is of type
177.Fa "const\ char\ *" .
178This flag permits inclusion of NULs in the RE; they are considered
179ordinary characters.
180This is an extension, compatible with but not specified by
181.St -p1003.2-92 ,
182and should be used with caution in software intended to be portable to
183other systems.
184.El
185.Pp
186When successful,
187.Fn regcomp
188returns 0 and fills in the structure pointed to by
189.Fa preg .
190One member of that structure (other than
191.Fa re_endp )
192is publicized:
193.Fa re_nsub ,
194of type
195.Fa size_t ,
196contains the number of parenthesized subexpressions within the RE
197(except that the value of this member is undefined if the
198.Dv REG_NOSUB
199flag was used).
200If
201.Fn regcomp
202fails, it returns a non-zero error code;
203see
204.Sx DIAGNOSTICS .
205.Pp
206.Fn regexec
207matches the compiled RE pointed to by
208.Fa preg
209against the
210.Fa string ,
211subject to the flags in
212.Fa eflags ,
213and reports results using
214.Fa nmatch ,
215.Fa pmatch ,
216and the returned value.
217The RE must have been compiled by a previous invocation of
218.Fn regcomp .
219The compiled form is not altered during execution of
220.Fn regexec ,
221so a single compiled RE can be used simultaneously by multiple threads.
222.Pp
223By default,
224the NUL-terminated string pointed to by
225.Fa string
226is considered to be the text of an entire line, minus any terminating
227newline.
228The
229.Fa eflags
230argument is the bitwise OR of zero or more of the following flags:
231.Bl -tag -width XXXREG_NOTBOL
232.It Dv REG_NOTBOL
233The first character of the string
234is not the beginning of a line, so the `^' anchor should not match before it.
235This does not affect the behavior of newlines under
236.Dv REG_NEWLINE .
237.It Dv REG_NOTEOL
238The NUL terminating the string does not end a line, so the `$' anchor
239should not match before it.
240This does not affect the behavior of newlines under
241.Dv REG_NEWLINE .
242.It Dv REG_STARTEND
243The string is considered to start at
244.Fa string
245+
246.Fa pmatch[0].rm_so
247and to have a terminating NUL located at
248.Fa string
249+
250.Fa pmatch[0].rm_eo
251(there need not actually be a NUL at that location),
252regardless of the value of
253.Fa nmatch .
254See below for the definition of
255.Fa pmatch
256and
257.Fa nmatch .
258This is an extension, compatible with but not specified by
259.St -p1003.2-92 ,
260and should be used with caution in software intended to be portable to
261other systems.
262Note that a non-zero
263.Fa rm_so
264does not imply
265.Dv REG_NOTBOL ;
266.Dv REG_STARTEND
267affects only the location of the string, not how it is matched.
268.El
269.Pp
270See
271.Xr re_format 7
272for a discussion of what is matched in situations where an RE or a
273portion thereof could match any of several substrings of
274.Fa string .
275.Pp
276Normally,
277.Fn regexec
278returns 0 for success and the non-zero code
279.Dv REG_NOMATCH
280for failure.
281Other non-zero error codes may be returned in exceptional situations;
282see
283.Sx DIAGNOSTICS .
284.Pp
285If
286.Dv REG_NOSUB
287was specified in the compilation of the RE, or if
288.Fa nmatch
289is 0,
290.Fn regexec
291ignores the
292.Fa pmatch
293argument (but see below for the case where
294.Dv REG_STARTEND
295is specified).
296Otherwise,
297.Fa pmatch
298points to an array of
299.Fa nmatch
300structures of type
301.Fa regmatch_t .
302Such a structure has at least the members
303.Fa rm_so
304and
305.Fa rm_eo ,
306both of type
307.Fa regoff_t
308(a signed arithmetic type at least as large as an
309.Fa off_t
310and a
311.Fa ssize_t ) ,
312containing respectively the offset of the first character of a substring
313and the offset of the first character after the end of the substring.
314Offsets are measured from the beginning of the
315.Fa string
316argument given to
317.Fn regexec .
318An empty substring is denoted by equal offsets,
319both indicating the character following the empty substring.
320.Pp
321The 0th member of the
322.Fa pmatch
323array is filled in to indicate what substring of
324.Fa string
325was matched by the entire RE.
326Remaining members report what substring was matched by parenthesized
327subexpressions within the RE;
328member
329.Fa i
330reports subexpression
331.Fa i ,
332with subexpressions counted (starting at 1) by the order of their
333opening parentheses in the RE, left to right.
334Unused entries in the array\(emcorresponding either to subexpressions that
335did not participate in the match at all, or to subexpressions that do not
336exist in the RE (that is,
337.Fa i
338\*[Gt]
339.Fa preg-\*[Gt]re_nsub )
340\(emhave both
341.Fa rm_so
342and
343.Fa rm_eo
344set to -1.
345If a subexpression participated in the match several times,
346the reported substring is the last one it matched.
347(Note, as an example in particular, that when the RE `(b*)+' matches `bbb',
348the parenthesized subexpression matches each of the three `b's and then
349an infinite number of empty strings following the last `b',
350so the reported substring is one of the empties.)
351.Pp
352If
353.Dv REG_STARTEND
354is specified,
355.Fa pmatch
356must point to at least one
357.Fa regmatch_t
358(even if
359.Fa nmatch
360is 0 or
361.Dv REG_NOSUB
362was specified),
363to hold the input offsets for
364.Dv REG_STARTEND .
365Use for output is still entirely controlled by
366.Fa nmatch ;
367if
368.Fa nmatch
369is 0 or
370.Dv REG_NOSUB
371was specified,
372the value of
373.Fa pmatch [0]
374will not be changed by a successful
375.Fn regexec .
376.Pp
377.Fn regerror
378maps a non-zero
379.Fa errcode
380from either
381.Fn regcomp
382or
383.Fn regexec
384to a human-readable, printable message.
385If
386.Fa preg
387is non-NULL,
388the error code should have arisen from use of the
389.Fa regex_t
390pointed to by
391.Fa preg ,
392and if the error code came from
393.Fn regcomp ,
394it should have been the result from the most recent
395.Fn regcomp
396using that
397.Fa regex_t . (
398.Fn regerror
399may be able to supply a more detailed message using information
400from the
401.Fa regex_t . )
402.Fn regerror
403places the NUL-terminated message into the buffer pointed to by
404.Fa errbuf ,
405limiting the length (including the NUL) to at most
406.Fa errbuf_size
407bytes.
408If the whole message won't fit,
409as much of it as will fit before the terminating NUL is supplied.
410In any case,
411the returned value is the size of buffer needed to hold the whole
412message (including terminating NUL).
413If
414.Fa errbuf_size
415is 0,
416.Fa errbuf
417is ignored but the return value is still correct.
418.Pp
419If the
420.Fa errcode
421given to
422.Fn regerror
423is first ORed with
424.Dv REG_ITOA ,
425the ``message'' that results is the printable name of the error code,
426e.g. ``REG_NOMATCH'',
427rather than an explanation thereof.
428If
429.Fa errcode
430is
431.Dv REG_ATOI ,
432then
433.Fa preg
434shall be non-NULL and the
435.Fa re_endp
436member of the structure it points to
437must point to the printable name of an error code;
438in this case, the result in
439.Fa errbuf
440is the decimal digits of
441the numeric value of the error code
442(0 if the name is not recognized).
443.Dv REG_ITOA
444and
445.Dv REG_ATOI
446are intended primarily as debugging facilities;
447they are extensions, compatible with but not specified by
448.St -p1003.2-92 ,
449and should be used with caution in software intended to be portable to
450other systems.
451Be warned also that they are considered experimental and changes are possible.
452.Pp
453.Fn regfree
454frees any dynamically-allocated storage associated with the compiled RE
455pointed to by
456.Fa preg .
457The remaining
458.Fa regex_t
459is no longer a valid compiled RE
460and the effect of supplying it to
461.Fn regexec
462or
463.Fn regerror
464is undefined.
465.Pp
466None of these functions references global variables except for tables
467of constants;
468all are safe for use from multiple threads if the arguments are safe.
469.Sh IMPLEMENTATION CHOICES
470There are a number of decisions that
471.St -p1003.2-92
472leaves up to the implementor,
473either by explicitly saying ``undefined'' or by virtue of them being
474forbidden by the RE grammar.
475This implementation treats them as follows.
476.Pp
477See
478.Xr re_format 7
479for a discussion of the definition of case-independent matching.
480.Pp
481There is no particular limit on the length of REs,
482except insofar as memory is limited.
483Memory usage is approximately linear in RE size, and largely insensitive
484to RE complexity, except for bounded repetitions.
485See BUGS for one short RE using them
486that will run almost any system out of memory.
487.Pp
488A backslashed character other than one specifically given a magic meaning
489by
490.St -p1003.2-92
491(such magic meanings occur only in obsolete [``basic''] REs)
492is taken as an ordinary character.
493.Pp
494Any unmatched [ is a
495.Dv REG_EBRACK
496error.
497.Pp
498Equivalence classes cannot begin or end bracket-expression ranges.
499The endpoint of one range cannot begin another.
500.Pp
501.Dv RE_DUP_MAX ,
502the limit on repetition counts in bounded repetitions, is 255.
503.Pp
504A repetition operator (?, *, +, or bounds) cannot follow another
505repetition operator.
506A repetition operator cannot begin an expression or subexpression
507or follow `^' or `|'.
508.Pp
509`|' cannot appear first or last in a (sub)expression or after another `|',
510i.e. an operand of `|' cannot be an empty subexpression.
511An empty parenthesized subexpression, `()', is legal and matches an
512empty (sub)string.
513An empty string is not a legal RE.
514.Pp
515A `{' followed by a digit is considered the beginning of bounds for a
516bounded repetition, which must then follow the syntax for bounds.
517A `{'
518.Em not
519followed by a digit is considered an ordinary character.
520.Pp
521`^' and `$' beginning and ending subexpressions in obsolete (``basic'')
522REs are anchors, not ordinary characters.
523.Sh DIAGNOSTICS
524Non-zero error codes from
525.Fn regcomp
526and
527.Fn regexec
528include the following:
529.Pp
530.Bl -tag -width XXXREG_ECOLLATE -compact
531.It Dv REG_NOMATCH
532.Fn regexec
533failed to match
534.It Dv REG_BADPAT
535invalid regular expression
536.It Dv REG_ECOLLATE
537invalid collating element
538.It Dv REG_ECTYPE
539invalid character class
540.It Dv REG_EESCAPE
541\e applied to unescapable character
542.It Dv REG_ESUBREG
543invalid backreference number
544.It Dv REG_EBRACK
545brackets [ ] not balanced
546.It Dv REG_EPAREN
547parentheses ( ) not balanced
548.It Dv REG_EBRACE
549braces { } not balanced
550.It Dv REG_BADBR
551invalid repetition count(s) in { }
552.It Dv REG_ERANGE
553invalid character range in [ ]
554.It Dv REG_ESPACE
555ran out of memory
556.It Dv REG_BADRPT
557?, *, or + operand invalid
558.It Dv REG_EMPTY
559empty (sub)expression
560.It Dv REG_ASSERT
561``can't happen''\(emyou found a bug
562.It Dv REG_INVARG
563invalid argument, e.g. negative-length string
564.El
565.Sh SEE ALSO
566.Xr grep 1 ,
567.Xr sed 1 ,
568.Xr re_format 7
569.Pp
570.St -p1003.2-92 ,
571sections 2.8 (Regular Expression Notation)
572and
573B.5 (C Binding for Regular Expression Matching).
574.Sh HISTORY
575Originally written by Henry Spencer.
576Altered for inclusion in the
577.Bx 4.4
578distribution.
579.Sh BUGS
580There is one known functionality bug.
581The implementation of internationalization is incomplete:
582the locale is always assumed to be the default one of
583.St -p1003.2-92 ,
584and only the collating elements etc. of that locale are available.
585.Pp
586The back-reference code is subtle and doubts linger about its correctness
587in complex cases.
588.Pp
589.Fn regexec
590performance is poor.
591This will improve with later releases.
592.Fa nmatch
593exceeding 0 is expensive;
594.Fa nmatch
595exceeding 1 is worse.
596.Fa regexec
597is largely insensitive to RE complexity
598.Em except
599that back references are massively expensive.
600RE length does matter; in particular, there is a strong speed bonus
601for keeping RE length under about 30 characters,
602with most special characters counting roughly double.
603.Pp
604.Fn regcomp
605implements bounded repetitions by macro expansion,
606which is costly in time and space if counts are large
607or bounded repetitions are nested.
608An RE like, say,
609`((((a{1,100}){1,100}){1,100}){1,100}){1,100}'
610will (eventually) run almost any existing machine out of swap space.
611.Pp
612There are suspected problems with response to obscure error conditions.
613Notably,
614certain kinds of internal overflow,
615produced only by truly enormous REs or by multiply nested bounded repetitions,
616are probably not handled well.
617.Pp
618Due to a mistake in
619.St -p1003.2-92 ,
620things like `a)b' are legal REs because `)' is a special character
621only in the presence of a previous unmatched `('.
622This can't be fixed until the spec is fixed.
623.Pp
624The standard's definition of back references is vague.
625For example, does
626`a\e(\e(b\e)*\e2\e)*d' match `abbbd'?
627Until the standard is clarified, behavior in such cases should not be
628relied on.
629.Pp
630The implementation of word-boundary matching is a bit of a kludge,
631and bugs may lurk in combinations of word-boundary matching and anchoring.
632