xref: /freebsd/lib/libc/regex/regex.3 (revision 1f474190)
1.\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
2.\" Copyright (c) 1992, 1993, 1994
3.\"	The Regents of the University of California.  All rights reserved.
4.\"
5.\" This code is derived from software contributed to Berkeley by
6.\" Henry Spencer.
7.\"
8.\" Redistribution and use in source and binary forms, with or without
9.\" modification, are permitted provided that the following conditions
10.\" are met:
11.\" 1. Redistributions of source code must retain the above copyright
12.\"    notice, this list of conditions and the following disclaimer.
13.\" 2. Redistributions in binary form must reproduce the above copyright
14.\"    notice, this list of conditions and the following disclaimer in the
15.\"    documentation and/or other materials provided with the distribution.
16.\" 3. Neither the name of the University nor the names of its contributors
17.\"    may be used to endorse or promote products derived from this software
18.\"    without specific prior written permission.
19.\"
20.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
21.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
23.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
24.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
30.\" SUCH DAMAGE.
31.\"
32.\"	@(#)regex.3	8.4 (Berkeley) 3/20/94
33.\" $FreeBSD$
34.\"
35.Dd April 15, 2017
36.Dt REGEX 3
37.Os
38.Sh NAME
39.Nm regcomp ,
40.Nm regexec ,
41.Nm regerror ,
42.Nm regfree
43.Nd regular-expression library
44.Sh LIBRARY
45.Lb libc
46.Sh SYNOPSIS
47.In regex.h
48.Ft int
49.Fo regcomp
50.Fa "regex_t * restrict preg" "const char * restrict pattern" "int cflags"
51.Fc
52.Ft int
53.Fo regexec
54.Fa "const regex_t * restrict preg" "const char * restrict string"
55.Fa "size_t nmatch" "regmatch_t pmatch[restrict]" "int eflags"
56.Fc
57.Ft size_t
58.Fo regerror
59.Fa "int errcode" "const regex_t * restrict preg"
60.Fa "char * restrict errbuf" "size_t errbuf_size"
61.Fc
62.Ft void
63.Fn regfree "regex_t *preg"
64.Sh DESCRIPTION
65These routines implement
66.St -p1003.2
67regular expressions
68.Pq Do RE Dc Ns s ;
69see
70.Xr re_format 7 .
71The
72.Fn regcomp
73function
74compiles an RE written as a string into an internal form,
75.Fn regexec
76matches that internal form against a string and reports results,
77.Fn regerror
78transforms error codes from either into human-readable messages,
79and
80.Fn regfree
81frees any dynamically-allocated storage used by the internal form
82of an RE.
83.Pp
84The header
85.In regex.h
86declares two structure types,
87.Ft regex_t
88and
89.Ft regmatch_t ,
90the former for compiled internal forms and the latter for match reporting.
91It also declares the four functions,
92a type
93.Ft regoff_t ,
94and a number of constants with names starting with
95.Dq Dv REG_ .
96.Pp
97The
98.Fn regcomp
99function
100compiles the regular expression contained in the
101.Fa pattern
102string,
103subject to the flags in
104.Fa cflags ,
105and places the results in the
106.Ft regex_t
107structure pointed to by
108.Fa preg .
109The
110.Fa cflags
111argument
112is the bitwise OR of zero or more of the following flags:
113.Bl -tag -width REG_EXTENDED
114.It Dv REG_EXTENDED
115Compile modern
116.Pq Dq extended
117REs,
118rather than the obsolete
119.Pq Dq basic
120REs that
121are the default.
122.It Dv REG_BASIC
123This is a synonym for 0,
124provided as a counterpart to
125.Dv REG_EXTENDED
126to improve readability.
127.It Dv REG_NOSPEC
128Compile with recognition of all special characters turned off.
129All characters are thus considered ordinary,
130so the
131.Dq RE
132is a literal string.
133This is an extension,
134compatible with but not specified by
135.St -p1003.2 ,
136and should be used with
137caution in software intended to be portable to other systems.
138.Dv REG_EXTENDED
139and
140.Dv REG_NOSPEC
141may not be used
142in the same call to
143.Fn regcomp .
144.It Dv REG_ICASE
145Compile for matching that ignores upper/lower case distinctions.
146See
147.Xr re_format 7 .
148.It Dv REG_NOSUB
149Compile for matching that need only report success or failure,
150not what was matched.
151.It Dv REG_NEWLINE
152Compile for newline-sensitive matching.
153By default, newline is a completely ordinary character with no special
154meaning in either REs or strings.
155With this flag,
156.Ql [^
157bracket expressions and
158.Ql .\&
159never match newline,
160a
161.Ql ^\&
162anchor matches the null string after any newline in the string
163in addition to its normal function,
164and the
165.Ql $\&
166anchor matches the null string before any newline in the
167string in addition to its normal function.
168.It Dv REG_PEND
169The regular expression ends,
170not at the first NUL,
171but just before the character pointed to by the
172.Va re_endp
173member of the structure pointed to by
174.Fa preg .
175The
176.Va re_endp
177member is of type
178.Ft "const char *" .
179This flag permits inclusion of NULs in the RE;
180they are considered ordinary characters.
181This is an extension,
182compatible with but not specified by
183.St -p1003.2 ,
184and should be used with
185caution in software intended to be portable to other systems.
186.It Dv REG_POSIX
187Compile only
188.St -p1003.2
189compliant expressions.
190This flag has no effect unless linking against
191.Nm libregex .
192This is an extension,
193compatible with but not specified by
194.St -p1003.2 ,
195and should be used with
196caution in software intended to be portable to other systems.
197.El
198.Pp
199When successful,
200.Fn regcomp
201returns 0 and fills in the structure pointed to by
202.Fa preg .
203One member of that structure
204(other than
205.Va re_endp )
206is publicized:
207.Va re_nsub ,
208of type
209.Ft size_t ,
210contains the number of parenthesized subexpressions within the RE
211(except that the value of this member is undefined if the
212.Dv REG_NOSUB
213flag was used).
214If
215.Fn regcomp
216fails, it returns a non-zero error code;
217see
218.Sx DIAGNOSTICS .
219.Pp
220The
221.Fn regexec
222function
223matches the compiled RE pointed to by
224.Fa preg
225against the
226.Fa string ,
227subject to the flags in
228.Fa eflags ,
229and reports results using
230.Fa nmatch ,
231.Fa pmatch ,
232and the returned value.
233The RE must have been compiled by a previous invocation of
234.Fn regcomp .
235The compiled form is not altered during execution of
236.Fn regexec ,
237so a single compiled RE can be used simultaneously by multiple threads.
238.Pp
239By default,
240the NUL-terminated string pointed to by
241.Fa string
242is considered to be the text of an entire line, minus any terminating
243newline.
244The
245.Fa eflags
246argument is the bitwise OR of zero or more of the following flags:
247.Bl -tag -width REG_STARTEND
248.It Dv REG_NOTBOL
249The first character of the string is treated as the continuation
250of a line.
251This means that the anchors
252.Ql ^\& ,
253.Ql [[:<:]] ,
254and
255.Ql \e<
256do not match before it; but see
257.Dv REG_STARTEND
258below.
259This does not affect the behavior of newlines under
260.Dv REG_NEWLINE .
261.It Dv REG_NOTEOL
262The NUL terminating
263the string
264does not end a line, so the
265.Ql $\&
266anchor does not match before it.
267This does not affect the behavior of newlines under
268.Dv REG_NEWLINE .
269.It Dv REG_STARTEND
270The string is considered to start at
271.Fa string No +
272.Fa pmatch Ns [0]. Ns Fa rm_so
273and to end before the byte located at
274.Fa string No +
275.Fa pmatch Ns [0]. Ns Fa rm_eo ,
276regardless of the value of
277.Fa nmatch .
278See below for the definition of
279.Fa pmatch
280and
281.Fa nmatch .
282This is an extension,
283compatible with but not specified by
284.St -p1003.2 ,
285and should be used with
286caution in software intended to be portable to other systems.
287.Pp
288Without
289.Dv REG_NOTBOL ,
290the position
291.Fa rm_so
292is considered the beginning of a line, such that
293.Ql ^
294matches before it, and the beginning of a word if there is a word
295character at this position, such that
296.Ql [[:<:]]
297and
298.Ql \e<
299match before it.
300.Pp
301With
302.Dv REG_NOTBOL ,
303the character at position
304.Fa rm_so
305is treated as the continuation of a line, and if
306.Fa rm_so
307is greater than 0, the preceding character is taken into consideration.
308If the preceding character is a newline and the regular expression was compiled
309with
310.Dv REG_NEWLINE ,
311.Ql ^
312matches before the string; if the preceding character is not a word character
313but the string starts with a word character,
314.Ql [[:<:]]
315and
316.Ql \e<
317match before the string.
318.El
319.Pp
320See
321.Xr re_format 7
322for a discussion of what is matched in situations where an RE or a
323portion thereof could match any of several substrings of
324.Fa string .
325.Pp
326Normally,
327.Fn regexec
328returns 0 for success and the non-zero code
329.Dv REG_NOMATCH
330for failure.
331Other non-zero error codes may be returned in exceptional situations;
332see
333.Sx DIAGNOSTICS .
334.Pp
335If
336.Dv REG_NOSUB
337was specified in the compilation of the RE,
338or if
339.Fa nmatch
340is 0,
341.Fn regexec
342ignores the
343.Fa pmatch
344argument (but see below for the case where
345.Dv REG_STARTEND
346is specified).
347Otherwise,
348.Fa pmatch
349points to an array of
350.Fa nmatch
351structures of type
352.Ft regmatch_t .
353Such a structure has at least the members
354.Va rm_so
355and
356.Va rm_eo ,
357both of type
358.Ft regoff_t
359(a signed arithmetic type at least as large as an
360.Ft off_t
361and a
362.Ft ssize_t ) ,
363containing respectively the offset of the first character of a substring
364and the offset of the first character after the end of the substring.
365Offsets are measured from the beginning of the
366.Fa string
367argument given to
368.Fn regexec .
369An empty substring is denoted by equal offsets,
370both indicating the character following the empty substring.
371.Pp
372The 0th member of the
373.Fa pmatch
374array is filled in to indicate what substring of
375.Fa string
376was matched by the entire RE.
377Remaining members report what substring was matched by parenthesized
378subexpressions within the RE;
379member
380.Va i
381reports subexpression
382.Va i ,
383with subexpressions counted (starting at 1) by the order of their opening
384parentheses in the RE, left to right.
385Unused entries in the array (corresponding either to subexpressions that
386did not participate in the match at all, or to subexpressions that do not
387exist in the RE (that is,
388.Va i
389>
390.Fa preg Ns -> Ns Va re_nsub ) )
391have both
392.Va rm_so
393and
394.Va rm_eo
395set to -1.
396If a subexpression participated in the match several times,
397the reported substring is the last one it matched.
398(Note, as an example in particular, that when the RE
399.Ql "(b*)+"
400matches
401.Ql bbb ,
402the parenthesized subexpression matches each of the three
403.So Li b Sc Ns s
404and then
405an infinite number of empty strings following the last
406.Ql b ,
407so the reported substring is one of the empties.)
408.Pp
409If
410.Dv REG_STARTEND
411is specified,
412.Fa pmatch
413must point to at least one
414.Ft regmatch_t
415(even if
416.Fa nmatch
417is 0 or
418.Dv REG_NOSUB
419was specified),
420to hold the input offsets for
421.Dv REG_STARTEND .
422Use for output is still entirely controlled by
423.Fa nmatch ;
424if
425.Fa nmatch
426is 0 or
427.Dv REG_NOSUB
428was specified,
429the value of
430.Fa pmatch Ns [0]
431will not be changed by a successful
432.Fn regexec .
433.Pp
434The
435.Fn regerror
436function
437maps a non-zero
438.Fa errcode
439from either
440.Fn regcomp
441or
442.Fn regexec
443to a human-readable, printable message.
444If
445.Fa preg
446is
447.No non\- Ns Dv NULL ,
448the error code should have arisen from use of
449the
450.Ft regex_t
451pointed to by
452.Fa preg ,
453and if the error code came from
454.Fn regcomp ,
455it should have been the result from the most recent
456.Fn regcomp
457using that
458.Ft regex_t .
459The
460.Po
461.Fn regerror
462may be able to supply a more detailed message using information
463from the
464.Ft regex_t .
465.Pc
466The
467.Fn regerror
468function
469places the NUL-terminated message into the buffer pointed to by
470.Fa errbuf ,
471limiting the length (including the NUL) to at most
472.Fa errbuf_size
473bytes.
474If the whole message will not fit,
475as much of it as will fit before the terminating NUL is supplied.
476In any case,
477the returned value is the size of buffer needed to hold the whole
478message (including terminating NUL).
479If
480.Fa errbuf_size
481is 0,
482.Fa errbuf
483is ignored but the return value is still correct.
484.Pp
485If the
486.Fa errcode
487given to
488.Fn regerror
489is first ORed with
490.Dv REG_ITOA ,
491the
492.Dq message
493that results is the printable name of the error code,
494e.g.\&
495.Dq Dv REG_NOMATCH ,
496rather than an explanation thereof.
497If
498.Fa errcode
499is
500.Dv REG_ATOI ,
501then
502.Fa preg
503shall be
504.No non\- Ns Dv NULL
505and the
506.Va re_endp
507member of the structure it points to
508must point to the printable name of an error code;
509in this case, the result in
510.Fa errbuf
511is the decimal digits of
512the numeric value of the error code
513(0 if the name is not recognized).
514.Dv REG_ITOA
515and
516.Dv REG_ATOI
517are intended primarily as debugging facilities;
518they are extensions,
519compatible with but not specified by
520.St -p1003.2 ,
521and should be used with
522caution in software intended to be portable to other systems.
523Be warned also that they are considered experimental and changes are possible.
524.Pp
525The
526.Fn regfree
527function
528frees any dynamically-allocated storage associated with the compiled RE
529pointed to by
530.Fa preg .
531The remaining
532.Ft regex_t
533is no longer a valid compiled RE
534and the effect of supplying it to
535.Fn regexec
536or
537.Fn regerror
538is undefined.
539.Pp
540None of these functions references global variables except for tables
541of constants;
542all are safe for use from multiple threads if the arguments are safe.
543.Sh IMPLEMENTATION CHOICES
544There are a number of decisions that
545.St -p1003.2
546leaves up to the implementor,
547either by explicitly saying
548.Dq undefined
549or by virtue of them being
550forbidden by the RE grammar.
551This implementation treats them as follows.
552.Pp
553See
554.Xr re_format 7
555for a discussion of the definition of case-independent matching.
556.Pp
557There is no particular limit on the length of REs,
558except insofar as memory is limited.
559Memory usage is approximately linear in RE size, and largely insensitive
560to RE complexity, except for bounded repetitions.
561See
562.Sx BUGS
563for one short RE using them
564that will run almost any system out of memory.
565.Pp
566A backslashed character other than one specifically given a magic meaning
567by
568.St -p1003.2
569(such magic meanings occur only in obsolete
570.Bq Dq basic
571REs)
572is taken as an ordinary character.
573.Pp
574Any unmatched
575.Ql [\&
576is a
577.Dv REG_EBRACK
578error.
579.Pp
580Equivalence classes cannot begin or end bracket-expression ranges.
581The endpoint of one range cannot begin another.
582.Pp
583.Dv RE_DUP_MAX ,
584the limit on repetition counts in bounded repetitions, is 255.
585.Pp
586A repetition operator
587.Ql ( ?\& ,
588.Ql *\& ,
589.Ql +\& ,
590or bounds)
591cannot follow another
592repetition operator.
593A repetition operator cannot begin an expression or subexpression
594or follow
595.Ql ^\&
596or
597.Ql |\& .
598.Pp
599.Ql |\&
600cannot appear first or last in a (sub)expression or after another
601.Ql |\& ,
602i.e., an operand of
603.Ql |\&
604cannot be an empty subexpression.
605An empty parenthesized subexpression,
606.Ql "()" ,
607is legal and matches an
608empty (sub)string.
609An empty string is not a legal RE.
610.Pp
611A
612.Ql {\&
613followed by a digit is considered the beginning of bounds for a
614bounded repetition, which must then follow the syntax for bounds.
615A
616.Ql {\&
617.Em not
618followed by a digit is considered an ordinary character.
619.Pp
620.Ql ^\&
621and
622.Ql $\&
623beginning and ending subexpressions in obsolete
624.Pq Dq basic
625REs are anchors, not ordinary characters.
626.Sh DIAGNOSTICS
627Non-zero error codes from
628.Fn regcomp
629and
630.Fn regexec
631include the following:
632.Pp
633.Bl -tag -width REG_ECOLLATE -compact
634.It Dv REG_NOMATCH
635The
636.Fn regexec
637function
638failed to match
639.It Dv REG_BADPAT
640invalid regular expression
641.It Dv REG_ECOLLATE
642invalid collating element
643.It Dv REG_ECTYPE
644invalid character class
645.It Dv REG_EESCAPE
646.Ql \e
647applied to unescapable character
648.It Dv REG_ESUBREG
649invalid backreference number
650.It Dv REG_EBRACK
651brackets
652.Ql "[ ]"
653not balanced
654.It Dv REG_EPAREN
655parentheses
656.Ql "( )"
657not balanced
658.It Dv REG_EBRACE
659braces
660.Ql "{ }"
661not balanced
662.It Dv REG_BADBR
663invalid repetition count(s) in
664.Ql "{ }"
665.It Dv REG_ERANGE
666invalid character range in
667.Ql "[ ]"
668.It Dv REG_ESPACE
669ran out of memory
670.It Dv REG_BADRPT
671.Ql ?\& ,
672.Ql *\& ,
673or
674.Ql +\&
675operand invalid
676.It Dv REG_EMPTY
677empty (sub)expression
678.It Dv REG_ASSERT
679cannot happen - you found a bug
680.It Dv REG_INVARG
681invalid argument, e.g.\& negative-length string
682.It Dv REG_ILLSEQ
683illegal byte sequence (bad multibyte character)
684.El
685.Sh SEE ALSO
686.Xr grep 1 ,
687.Xr re_format 7
688.Pp
689.St -p1003.2 ,
690sections 2.8 (Regular Expression Notation)
691and
692B.5 (C Binding for Regular Expression Matching).
693.Sh HISTORY
694Originally written by
695.An Henry Spencer .
696Altered for inclusion in the
697.Bx 4.4
698distribution.
699.Sh BUGS
700This is an alpha release with known defects.
701Please report problems.
702.Pp
703The back-reference code is subtle and doubts linger about its correctness
704in complex cases.
705.Pp
706The
707.Fn regexec
708function
709performance is poor.
710This will improve with later releases.
711The
712.Fa nmatch
713argument
714exceeding 0 is expensive;
715.Fa nmatch
716exceeding 1 is worse.
717The
718.Fn regexec
719function
720is largely insensitive to RE complexity
721.Em except
722that back
723references are massively expensive.
724RE length does matter; in particular, there is a strong speed bonus
725for keeping RE length under about 30 characters,
726with most special characters counting roughly double.
727.Pp
728The
729.Fn regcomp
730function
731implements bounded repetitions by macro expansion,
732which is costly in time and space if counts are large
733or bounded repetitions are nested.
734An RE like, say,
735.Ql "((((a{1,100}){1,100}){1,100}){1,100}){1,100}"
736will (eventually) run almost any existing machine out of swap space.
737.Pp
738There are suspected problems with response to obscure error conditions.
739Notably,
740certain kinds of internal overflow,
741produced only by truly enormous REs or by multiply nested bounded repetitions,
742are probably not handled well.
743.Pp
744Due to a mistake in
745.St -p1003.2 ,
746things like
747.Ql "a)b"
748are legal REs because
749.Ql )\&
750is
751a special character only in the presence of a previous unmatched
752.Ql (\& .
753This cannot be fixed until the spec is fixed.
754.Pp
755The standard's definition of back references is vague.
756For example, does
757.Ql "a\e(\e(b\e)*\e2\e)*d"
758match
759.Ql "abbbd" ?
760Until the standard is clarified,
761behavior in such cases should not be relied on.
762.Pp
763The implementation of word-boundary matching is a bit of a kludge,
764and bugs may lurk in combinations of word-boundary matching and anchoring.
765.Pp
766Word-boundary matching does not work properly in multibyte locales.
767