xref: /openbsd/lib/libc/regex/regex.3 (revision 274d7c50)
1.\"	$OpenBSD: regex.3,v 1.27 2016/05/26 05:46:44 martijn Exp $
2.\"
3.\" Copyright (c) 1997, Phillip F Knaack. All rights reserved.
4.\"
5.\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
6.\" Copyright (c) 1992, 1993, 1994
7.\"	The Regents of the University of California.  All rights reserved.
8.\"
9.\" This code is derived from software contributed to Berkeley by
10.\" Henry Spencer.
11.\"
12.\" Redistribution and use in source and binary forms, with or without
13.\" modification, are permitted provided that the following conditions
14.\" are met:
15.\" 1. Redistributions of source code must retain the above copyright
16.\"    notice, this list of conditions and the following disclaimer.
17.\" 2. Redistributions in binary form must reproduce the above copyright
18.\"    notice, this list of conditions and the following disclaimer in the
19.\"    documentation and/or other materials provided with the distribution.
20.\" 3. Neither the name of the University nor the names of its contributors
21.\"    may be used to endorse or promote products derived from this software
22.\"    without specific prior written permission.
23.\"
24.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
25.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
26.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
27.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
28.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
29.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
30.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
31.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
32.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
33.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
34.\" SUCH DAMAGE.
35.\"
36.\"	@(#)regex.3	8.4 (Berkeley) 3/20/94
37.\"
38.Dd $Mdocdate: May 26 2016 $
39.Dt REGEXEC 3
40.Os
41.Sh NAME
42.Nm regcomp ,
43.Nm regexec ,
44.Nm regerror ,
45.Nm regfree
46.Nd regular expression routines
47.Sh SYNOPSIS
48.In sys/types.h
49.In regex.h
50.Ft int
51.Fn regcomp "regex_t *preg" "const char *pattern" "int cflags"
52.Pp
53.Ft int
54.Fn regexec "const regex_t *preg" "const char *string" "size_t nmatch" \
55            "regmatch_t pmatch[]" "int eflags"
56.Pp
57.Ft size_t
58.Fn regerror "int errcode" "const regex_t *preg" "char *errbuf" \
59             "size_t errbuf_size"
60.Pp
61.Ft void
62.Fn regfree "regex_t *preg"
63.Sh DESCRIPTION
64These routines implement
65.St -p1003.2
66regular expressions
67.Pq Dq REs ;
68see
69.Xr re_format 7 .
70.Fn regcomp
71compiles an RE written as a string into an internal form,
72.Fn regexec
73matches that internal form against a string and reports results,
74.Fn regerror
75transforms error codes from either into human-readable messages, and
76.Fn regfree
77frees any dynamically allocated storage used by the internal form
78of an RE.
79.Pp
80The header
81.In regex.h
82declares two structure types,
83.Li regex_t
84and
85.Li regmatch_t ,
86the former for compiled internal forms and the latter for match reporting.
87It also declares the four functions,
88a type
89.Li regoff_t ,
90and a number of constants with names starting with
91.Dv REG_ .
92.Pp
93.Fn regcomp
94compiles the regular expression contained in the
95.Fa pattern
96string,
97subject to the flags in
98.Fa cflags ,
99and places the results in the
100.Li regex_t
101structure pointed to by
102.Fa preg .
103The
104.Fa cflags
105argument is the bitwise
106.Tn OR
107of zero or more of the following values:
108.Bl -tag -width XREG_EXTENDEDX
109.It Dv REG_EXTENDED
110Compile modern
111.Pq Dq extended
112REs,
113rather than the obsolete
114.Pq Dq basic
115REs that are the default.
116.It Dv REG_BASIC
117This is a synonym for 0,
118provided as a counterpart to
119.Dv REG_EXTENDED
120to improve readability.
121.It Dv REG_NOSPEC
122Compile with recognition of all special characters turned off.
123All characters are thus considered ordinary,
124so the RE is a literal string.
125This is an extension,
126compatible with but not specified by
127.St -p1003.2 ,
128and should be used with
129caution in software intended to be portable to other systems.
130.Dv REG_EXTENDED
131and
132.Dv REG_NOSPEC
133may not be used in the same call to
134.Fn regcomp .
135.It Dv REG_ICASE
136Compile for matching that ignores upper/lower case distinctions.
137See
138.Xr re_format 7 .
139.It Dv REG_NOSUB
140Compile for matching that need only report success or failure,
141not what was matched.
142.It Dv REG_NEWLINE
143Compile for newline-sensitive matching.
144By default, newline is a completely ordinary character with no special
145meaning in either REs or strings.
146With this flag,
147.Ql \&[^
148bracket expressions and
149.Ql \&.
150never match newline,
151a
152.Ql ^
153anchor matches the null string after any newline in the string
154in addition to its normal function,
155and the
156.Ql $
157anchor matches the null string before any newline in the
158string in addition to its normal function.
159.It Dv REG_PEND
160The regular expression ends,
161not at the first NUL,
162but just before the character pointed to by the
163.Fa re_endp
164member of the structure pointed to by
165.Fa preg .
166The
167.Fa re_endp
168member is of type
169.Fa const\ char\ * .
170This flag permits inclusion of NULs in the RE;
171they are considered ordinary characters.
172This is an extension,
173compatible with but not specified by
174.St -p1003.2 ,
175and should be used with
176caution in software intended to be portable to other systems.
177.El
178.Pp
179When successful,
180.Fn regcomp
181returns 0 and fills in the structure pointed to by
182.Fa preg .
183One member of that structure
184(other than
185.Fa re_endp )
186is publicized:
187.Fa re_nsub ,
188of type
189.Fa size_t ,
190contains the number of parenthesized subexpressions within the RE
191(except that the value of this member is undefined if the
192.Dv REG_NOSUB
193flag was used).
194If
195.Fn regcomp
196fails, it returns a non-zero error code;
197see DIAGNOSTICS.
198.Pp
199.Fn regexec
200matches the compiled RE pointed to by
201.Fa preg
202against the
203.Fa string ,
204subject to the flags in
205.Fa eflags ,
206and reports results using
207.Fa nmatch ,
208.Fa pmatch ,
209and the returned value.
210The RE must have been compiled by a previous invocation of
211.Fn regcomp .
212The compiled form is not altered during execution of
213.Fn regexec ,
214so a single compiled RE can be used simultaneously by multiple threads.
215.Pp
216By default,
217the NUL-terminated string pointed to by
218.Fa string
219is considered to be the text of an entire line, minus any terminating
220newline.
221The
222.Fa eflags
223argument is the bitwise
224.Tn OR
225of zero or more of the following values:
226.Bl -tag -width XREG_STARTENDX
227.It Dv REG_NOTBOL
228The first character of the string is treated as the continuation
229of a line.
230This means that the anchors
231.Ql ^ ,
232.Ql [[:<:]] ,
233and
234.Ql \e<
235do not match before it; but see
236.Dv REG_STARTEND
237below.
238This does not affect the behavior of newlines under
239.Dv REG_NEWLINE .
240.It Dv REG_NOTEOL
241The NUL terminating
242the string
243does not end a line, so the
244.Ql $
245anchor does not match before it.
246This does not affect the behavior of newlines under
247.Dv REG_NEWLINE .
248.It Dv REG_STARTEND
249The string is considered to start at
250.Fa string No +
251.Fa pmatch Ns [0]. Ns Fa rm_so
252and to end before the byte located at
253.Fa string No +
254.Fa pmatch Ns [0]. Ns Fa rm_eo ,
255regardless of the value of
256.Fa nmatch .
257See below for the definition of
258.Fa pmatch
259and
260.Fa nmatch .
261This is an extension,
262compatible with but not specified by
263.St -p1003.2 ,
264and should be used with
265caution in software intended to be portable to other systems.
266.Pp
267Without
268.Dv REG_NOTBOL ,
269the position
270.Fa rm_so
271is considered the beginning of a line, such that
272.Ql ^
273matches before it, and the beginning of a word if there is a word
274character at this position, such that
275.Ql [[:<:]]
276and
277.Ql \e<
278match before it.
279.Pp
280With
281.Dv REG_NOTBOL ,
282the character at position
283.Fa rm_so
284is treated as the continuation of a line, and if
285.Fa rm_so
286is greater than 0, the preceding character is taken into consideration.
287If the preceding character is a newline and the regular expression was compiled
288with
289.Dv REG_NEWLINE ,
290.Ql ^
291matches before the string; if the preceding character is not a word character
292but the string starts with a word character,
293.Ql [[:<:]]
294and
295.Ql \e<
296match before the string.
297.El
298.Pp
299See
300.Xr re_format 7
301for a discussion of what is matched in situations where an RE or a
302portion thereof could match any of several substrings of
303.Fa string .
304.Pp
305Normally,
306.Fn regexec
307returns 0 for success and the non-zero code
308.Dv REG_NOMATCH
309for failure.
310Other non-zero error codes may be returned in exceptional situations;
311see DIAGNOSTICS.
312.Pp
313If
314.Dv REG_NOSUB
315was specified in the compilation of the RE,
316or if
317.Fa nmatch
318is 0,
319.Fn regexec
320ignores the
321.Fa pmatch
322argument (but see below for the case where
323.Dv REG_STARTEND
324is specified).
325Otherwise,
326.Fa pmatch
327points to an array of
328.Fa nmatch
329structures of type
330.Li regmatch_t .
331Such a structure has at least the members
332.Fa rm_so
333and
334.Fa rm_eo ,
335both of type
336.Fa regoff_t
337(a signed arithmetic type at least as large as an
338.Li off_t
339and a
340.Li ssize_t ) ,
341containing respectively the offset of the first character of a substring
342and the offset of the first character after the end of the substring.
343Offsets are measured from the beginning of the
344.Fa string
345argument given to
346.Fn regexec .
347An empty substring is denoted by equal offsets,
348both indicating the character following the empty substring.
349.Pp
350The 0th member of the
351.Fa pmatch
352array is filled in to indicate what substring of
353.Fa string
354was matched by the entire RE.
355Remaining members report what substring was matched by parenthesized
356subexpressions within the RE;
357member
358.Va i
359reports subexpression
360.Va i ,
361with subexpressions counted (starting at 1) by the order of their opening
362parentheses in the RE, left to right.
363Unused entries in the array\(emcorresponding either to subexpressions that
364did not participate in the match at all, or to subexpressions that do not
365exist in the RE (that is, \fIi\fR\ > \fIpreg\fR\->\fIre_nsub\fR)\(emhave both
366.Fa rm_so
367and
368.Fa rm_eo
369set to \-1.
370If a subexpression participated in the match several times,
371the reported substring is the last one it matched.
372(Note, as an example in particular, that when the RE
373.Dq (b*)+
374matches
375.Dq bbb ,
376the parenthesized subexpression matches each of the three
377.Sq b Ns s
378and then
379an infinite number of empty strings following the last
380.Sq b ,
381so the reported substring is one of the empties.)
382.Pp
383If
384.Dv REG_STARTEND
385is specified,
386.Fa pmatch
387must point to at least one
388.Li regmatch_t
389(even if
390.Fa nmatch
391is 0 or
392.Dv REG_NOSUB
393was specified),
394to hold the input offsets for
395.Dv REG_STARTEND .
396Use for output is still entirely controlled by
397.Fa nmatch ;
398if
399.Fa nmatch
400is 0 or
401.Dv REG_NOSUB
402was specified,
403the value of
404.Fa pmatch[0]
405will not be changed by a successful
406.Fn regexec .
407.Pp
408.Fn regerror
409maps a non-zero
410.Va errcode
411from either
412.Fn regcomp
413or
414.Fn regexec
415to a human-readable, printable message.
416If
417.Fa preg
418is non-NULL,
419the error code should have arisen from use of
420the
421.Li regex_t
422pointed to by
423.Fa preg ,
424and if the error code came from
425.Fn regcomp ,
426it should have been the result from the most recent
427.Fn regcomp
428using that
429.Li regex_t .
430.Pf ( Fn regerror
431may be able to supply a more detailed message using information
432from the
433.Li regex_t . )
434.Fn regerror
435places the NUL-terminated message into the buffer pointed to by
436.Fa errbuf ,
437limiting the length (including the NUL) to at most
438.Fa errbuf_size
439bytes.
440If the whole message won't fit,
441as much of it as will fit before the terminating NUL is supplied.
442In any case,
443the returned value is the size of buffer needed to hold the whole
444message (including the terminating NUL).
445If
446.Fa errbuf_size
447is 0,
448.Fa errbuf
449is ignored but the return value is still correct.
450.Pp
451If the
452.Fa errcode
453given to
454.Fn regerror
455is first
456.Tn OR Ns 'ed
457with
458.Dv REG_ITOA ,
459the
460.Dq message
461that results is the printable name of the error code,
462e.g.,
463.Dq REG_NOMATCH ,
464rather than an explanation thereof.
465If
466.Fa errcode
467is
468.Dv REG_ATOI ,
469then
470.Fa preg
471shall be non-null and the
472.Fa re_endp
473member of the structure it points to
474must point to the printable name of an error code;
475in this case, the result in
476.Fa errbuf
477is the decimal digits of
478the numeric value of the error code
479(0 if the name is not recognized).
480.Dv REG_ITOA
481and
482.Dv REG_ATOI
483are intended primarily as debugging facilities;
484they are extensions,
485compatible with but not specified by
486.St -p1003.2
487and should be used with
488caution in software intended to be portable to other systems.
489Be warned also that they are considered experimental and changes are possible.
490.Pp
491.Fn regfree
492frees any dynamically allocated storage associated with the compiled RE
493pointed to by
494.Fa preg .
495The remaining
496.Li regex_t
497is no longer a valid compiled RE
498and the effect of supplying it to
499.Fn regexec
500or
501.Fn regerror
502is undefined.
503.Pp
504None of these functions references global variables except for tables
505of constants;
506all are safe for use from multiple threads if the arguments are safe.
507.Sh IMPLEMENTATION CHOICES
508There are a number of decisions that
509.St -p1003.2
510leaves up to the implementor,
511either by explicitly saying
512.Dq undefined
513or by virtue of them being
514forbidden by the RE grammar.
515This implementation treats them as follows.
516.Pp
517See
518.Xr re_format 7
519for a discussion of the definition of case-independent matching.
520.Pp
521There is no particular limit on the length of REs,
522except insofar as memory is limited.
523Memory usage is approximately linear in RE size, and largely insensitive
524to RE complexity, except for bounded repetitions.
525See
526.Sx BUGS
527for one short RE using them
528that will run almost any system out of memory.
529.Pp
530A backslashed character other than one specifically given a magic meaning
531by
532.St -p1003.2
533(such magic meanings occur only in obsolete REs)
534is taken as an ordinary character.
535.Pp
536Any unmatched
537.Ql \&[
538is a
539.Dv REG_EBRACK
540error.
541.Pp
542Equivalence classes cannot begin or end bracket-expression ranges.
543The endpoint of one range cannot begin another.
544.Pp
545RE_DUP_MAX, the limit on repetition counts in bounded repetitions, is 255.
546.Pp
547A repetition operator (?, *, +, or bounds) cannot follow another
548repetition operator.
549A repetition operator cannot begin an expression or subexpression
550or follow
551.Ql ^
552or
553.Ql | .
554.Pp
555A
556.Ql |
557cannot appear first or last in a (sub)expression, or after another
558.Ql | ,
559i.e., an operand of
560.Ql |
561cannot be an empty subexpression.
562An empty parenthesized subexpression,
563.Ql \&(\&) ,
564is legal and matches an
565empty (sub)string.
566An empty string is not a legal RE.
567.Pp
568A
569.Ql {
570followed by a digit is considered the beginning of bounds for a
571bounded repetition, which must then follow the syntax for bounds.
572A
573.Ql {
574.Em not
575followed by a digit is considered an ordinary character.
576.Pp
577.Ql ^
578and
579.Ql $
580beginning and ending subexpressions in obsolete
581.Pq Dq basic
582REs are anchors, not ordinary characters.
583.Sh DIAGNOSTICS
584Non-zero error codes from
585.Fn regcomp
586and
587.Fn regexec
588include the following:
589.Pp
590.Bl -tag -compact -width XREG_ECOLLATEX
591.It Er REG_NOMATCH
592regexec() failed to match
593.It Er REG_BADPAT
594invalid regular expression
595.It Er REG_ECOLLATE
596invalid collating element
597.It Er REG_ECTYPE
598invalid character class
599.It Er REG_EESCAPE
600\e applied to unescapable character
601.It Er REG_ESUBREG
602invalid backreference number
603.It Er REG_EBRACK
604brackets [ ] not balanced
605.It Er REG_EPAREN
606parentheses ( ) not balanced
607.It Er REG_EBRACE
608braces { } not balanced
609.It Er REG_BADBR
610invalid repetition count(s) in { }
611.It Er REG_ERANGE
612invalid character range in [ ]
613.It Er REG_ESPACE
614ran out of memory
615.It Er REG_BADRPT
616?, *, or + operand invalid
617.It Er REG_EMPTY
618empty (sub)expression
619.It Er REG_ASSERT
620.Dq can't happen
621\(emyou found a bug
622.It Er REG_INVARG
623invalid argument, e.g., negative-length string
624.El
625.Sh SEE ALSO
626.Xr grep 1 ,
627.Xr re_format 7
628.Pp
629.St -p1003.2 ,
630sections 2.8 (Regular Expression Notation)
631and
632B.5 (C Binding for Regular Expression Matching).
633.Sh HISTORY
634Predecessors to
635.Fn regcomp
636and
637.Fn regexec ,
638.Fn regcmp
639and
640.Fn regex ,
641first appeared in the Programmer's Workbench (PWB/UNIX).
642.Pp
643The present functions were originally written by Henry Spencer
644and altered for inclusion in the
645.Bx 4.4
646distribution.
647.Sh BUGS
648This is an alpha release with known defects.
649Please report problems.
650.Pp
651There is one known functionality bug.
652The implementation of internationalization is incomplete:
653the locale is always assumed to be the default one of
654.St -p1003.2 ,
655and only the collating elements etc. of that locale are available.
656.Pp
657The back-reference code is subtle and doubts linger about its correctness
658in complex cases.
659.Pp
660.Fn regexec
661performance is poor.
662This will improve with later releases.
663.Fa nmatch
664exceeding 0 is expensive;
665.Fa nmatch
666exceeding 1 is worse.
667.Fn regexec
668is largely insensitive to RE complexity
669.Em except
670that back references are massively expensive.
671RE length does matter; in particular, there is a strong speed bonus
672for keeping RE length under about 30 characters,
673with most special characters counting roughly double.
674.Pp
675.Fn regcomp
676implements bounded repetitions by macro expansion,
677which is costly in time and space if counts are large
678or bounded repetitions are nested.
679A RE like, say,
680.Dq ((((a{1,100}){1,100}){1,100}){1,100}){1,100}
681will (eventually) run almost any existing machine out of swap space.
682.Pp
683There are suspected problems with response to obscure error conditions.
684Notably,
685certain kinds of internal overflow,
686produced only by truly enormous REs or by multiply nested bounded repetitions,
687are probably not handled well.
688.Pp
689Due to a mistake in
690.St -p1003.2 ,
691things like
692.Ql a)b
693are legal REs because
694.Ql \&)
695is
696a special character only in the presence of a previous unmatched
697.Ql \&( .
698This can't be fixed until the spec is fixed.
699.Pp
700The standard's definition of back references is vague.
701For example, does
702.Dq a\e(\e(b\e)*\e2\e)*d
703match
704.Dq abbbd ?
705Until the standard is clarified,
706behavior in such cases should not be relied on.
707.Pp
708The implementation of word-boundary matching is a bit of a kludge,
709and bugs may lurk in combinations of word-boundary matching and anchoring.
710