xref: /freebsd/lib/libc/regex/regex.3 (revision f126890a)
1.\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
2.\" Copyright (c) 1992, 1993, 1994
3.\"	The Regents of the University of California.  All rights reserved.
4.\"
5.\" This code is derived from software contributed to Berkeley by
6.\" Henry Spencer.
7.\"
8.\" Redistribution and use in source and binary forms, with or without
9.\" modification, are permitted provided that the following conditions
10.\" are met:
11.\" 1. Redistributions of source code must retain the above copyright
12.\"    notice, this list of conditions and the following disclaimer.
13.\" 2. Redistributions in binary form must reproduce the above copyright
14.\"    notice, this list of conditions and the following disclaimer in the
15.\"    documentation and/or other materials provided with the distribution.
16.\" 3. Neither the name of the University nor the names of its contributors
17.\"    may be used to endorse or promote products derived from this software
18.\"    without specific prior written permission.
19.\"
20.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
21.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
23.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
24.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
30.\" SUCH DAMAGE.
31.\"
32.Dd April 15, 2017
33.Dt REGEX 3
34.Os
35.Sh NAME
36.Nm regcomp ,
37.Nm regexec ,
38.Nm regerror ,
39.Nm regfree
40.Nd regular-expression library
41.Sh LIBRARY
42.Lb libc
43.Sh SYNOPSIS
44.In regex.h
45.Ft int
46.Fo regcomp
47.Fa "regex_t * restrict preg" "const char * restrict pattern" "int cflags"
48.Fc
49.Ft int
50.Fo regexec
51.Fa "const regex_t * restrict preg" "const char * restrict string"
52.Fa "size_t nmatch" "regmatch_t pmatch[restrict]" "int eflags"
53.Fc
54.Ft size_t
55.Fo regerror
56.Fa "int errcode" "const regex_t * restrict preg"
57.Fa "char * restrict errbuf" "size_t errbuf_size"
58.Fc
59.Ft void
60.Fn regfree "regex_t *preg"
61.Sh DESCRIPTION
62These routines implement
63.St -p1003.2
64regular expressions
65.Pq Do RE Dc Ns s ;
66see
67.Xr re_format 7 .
68The
69.Fn regcomp
70function
71compiles an RE written as a string into an internal form,
72.Fn regexec
73matches that internal form against a string and reports results,
74.Fn regerror
75transforms error codes from either into human-readable messages,
76and
77.Fn regfree
78frees any dynamically-allocated storage used by the internal form
79of an RE.
80.Pp
81The header
82.In regex.h
83declares two structure types,
84.Ft regex_t
85and
86.Ft regmatch_t ,
87the former for compiled internal forms and the latter for match reporting.
88It also declares the four functions,
89a type
90.Ft regoff_t ,
91and a number of constants with names starting with
92.Dq Dv REG_ .
93.Pp
94The
95.Fn regcomp
96function
97compiles the regular expression contained in the
98.Fa pattern
99string,
100subject to the flags in
101.Fa cflags ,
102and places the results in the
103.Ft regex_t
104structure pointed to by
105.Fa preg .
106The
107.Fa cflags
108argument
109is the bitwise OR of zero or more of the following flags:
110.Bl -tag -width REG_EXTENDED
111.It Dv REG_EXTENDED
112Compile modern
113.Pq Dq extended
114REs,
115rather than the obsolete
116.Pq Dq basic
117REs that
118are the default.
119.It Dv REG_BASIC
120This is a synonym for 0,
121provided as a counterpart to
122.Dv REG_EXTENDED
123to improve readability.
124.It Dv REG_NOSPEC
125Compile with recognition of all special characters turned off.
126All characters are thus considered ordinary,
127so the
128.Dq RE
129is a literal string.
130This is an extension,
131compatible with but not specified by
132.St -p1003.2 ,
133and should be used with
134caution in software intended to be portable to other systems.
135.Dv REG_EXTENDED
136and
137.Dv REG_NOSPEC
138may not be used
139in the same call to
140.Fn regcomp .
141.It Dv REG_ICASE
142Compile for matching that ignores upper/lower case distinctions.
143See
144.Xr re_format 7 .
145.It Dv REG_NOSUB
146Compile for matching that need only report success or failure,
147not what was matched.
148.It Dv REG_NEWLINE
149Compile for newline-sensitive matching.
150By default, newline is a completely ordinary character with no special
151meaning in either REs or strings.
152With this flag,
153.Ql [^
154bracket expressions and
155.Ql .\&
156never match newline,
157a
158.Ql ^\&
159anchor matches the null string after any newline in the string
160in addition to its normal function,
161and the
162.Ql $\&
163anchor matches the null string before any newline in the
164string in addition to its normal function.
165.It Dv REG_PEND
166The regular expression ends,
167not at the first NUL,
168but just before the character pointed to by the
169.Va re_endp
170member of the structure pointed to by
171.Fa preg .
172The
173.Va re_endp
174member is of type
175.Ft "const char *" .
176This flag permits inclusion of NULs in the RE;
177they are considered ordinary characters.
178This is an extension,
179compatible with but not specified by
180.St -p1003.2 ,
181and should be used with
182caution in software intended to be portable to other systems.
183.It Dv REG_POSIX
184Compile only
185.St -p1003.2
186compliant expressions.
187This flag has no effect unless linking against
188.Nm libregex .
189This is an extension,
190compatible with but not specified by
191.St -p1003.2 ,
192and should be used with
193caution in software intended to be portable to other systems.
194.El
195.Pp
196When successful,
197.Fn regcomp
198returns 0 and fills in the structure pointed to by
199.Fa preg .
200One member of that structure
201(other than
202.Va re_endp )
203is publicized:
204.Va re_nsub ,
205of type
206.Ft size_t ,
207contains the number of parenthesized subexpressions within the RE
208(except that the value of this member is undefined if the
209.Dv REG_NOSUB
210flag was used).
211If
212.Fn regcomp
213fails, it returns a non-zero error code;
214see
215.Sx DIAGNOSTICS .
216.Pp
217The
218.Fn regexec
219function
220matches the compiled RE pointed to by
221.Fa preg
222against the
223.Fa string ,
224subject to the flags in
225.Fa eflags ,
226and reports results using
227.Fa nmatch ,
228.Fa pmatch ,
229and the returned value.
230The RE must have been compiled by a previous invocation of
231.Fn regcomp .
232The compiled form is not altered during execution of
233.Fn regexec ,
234so a single compiled RE can be used simultaneously by multiple threads.
235.Pp
236By default,
237the NUL-terminated string pointed to by
238.Fa string
239is considered to be the text of an entire line, minus any terminating
240newline.
241The
242.Fa eflags
243argument is the bitwise OR of zero or more of the following flags:
244.Bl -tag -width REG_STARTEND
245.It Dv REG_NOTBOL
246The first character of the string is treated as the continuation
247of a line.
248This means that the anchors
249.Ql ^\& ,
250.Ql [[:<:]] ,
251and
252.Ql \e<
253do not match before it; but see
254.Dv REG_STARTEND
255below.
256This does not affect the behavior of newlines under
257.Dv REG_NEWLINE .
258.It Dv REG_NOTEOL
259The NUL terminating
260the string
261does not end a line, so the
262.Ql $\&
263anchor does not match before it.
264This does not affect the behavior of newlines under
265.Dv REG_NEWLINE .
266.It Dv REG_STARTEND
267The string is considered to start at
268.Fa string No +
269.Fa pmatch Ns [0]. Ns Fa rm_so
270and to end before the byte located at
271.Fa string No +
272.Fa pmatch Ns [0]. Ns Fa rm_eo ,
273regardless of the value of
274.Fa nmatch .
275See below for the definition of
276.Fa pmatch
277and
278.Fa nmatch .
279This is an extension,
280compatible with but not specified by
281.St -p1003.2 ,
282and should be used with
283caution in software intended to be portable to other systems.
284.Pp
285Without
286.Dv REG_NOTBOL ,
287the position
288.Fa rm_so
289is considered the beginning of a line, such that
290.Ql ^
291matches before it, and the beginning of a word if there is a word
292character at this position, such that
293.Ql [[:<:]]
294and
295.Ql \e<
296match before it.
297.Pp
298With
299.Dv REG_NOTBOL ,
300the character at position
301.Fa rm_so
302is treated as the continuation of a line, and if
303.Fa rm_so
304is greater than 0, the preceding character is taken into consideration.
305If the preceding character is a newline and the regular expression was compiled
306with
307.Dv REG_NEWLINE ,
308.Ql ^
309matches before the string; if the preceding character is not a word character
310but the string starts with a word character,
311.Ql [[:<:]]
312and
313.Ql \e<
314match before the string.
315.El
316.Pp
317See
318.Xr re_format 7
319for a discussion of what is matched in situations where an RE or a
320portion thereof could match any of several substrings of
321.Fa string .
322.Pp
323Normally,
324.Fn regexec
325returns 0 for success and the non-zero code
326.Dv REG_NOMATCH
327for failure.
328Other non-zero error codes may be returned in exceptional situations;
329see
330.Sx DIAGNOSTICS .
331.Pp
332If
333.Dv REG_NOSUB
334was specified in the compilation of the RE,
335or if
336.Fa nmatch
337is 0,
338.Fn regexec
339ignores the
340.Fa pmatch
341argument (but see below for the case where
342.Dv REG_STARTEND
343is specified).
344Otherwise,
345.Fa pmatch
346points to an array of
347.Fa nmatch
348structures of type
349.Ft regmatch_t .
350Such a structure has at least the members
351.Va rm_so
352and
353.Va rm_eo ,
354both of type
355.Ft regoff_t
356(a signed arithmetic type at least as large as an
357.Ft off_t
358and a
359.Ft ssize_t ) ,
360containing respectively the offset of the first character of a substring
361and the offset of the first character after the end of the substring.
362Offsets are measured from the beginning of the
363.Fa string
364argument given to
365.Fn regexec .
366An empty substring is denoted by equal offsets,
367both indicating the character following the empty substring.
368.Pp
369The 0th member of the
370.Fa pmatch
371array is filled in to indicate what substring of
372.Fa string
373was matched by the entire RE.
374Remaining members report what substring was matched by parenthesized
375subexpressions within the RE;
376member
377.Va i
378reports subexpression
379.Va i ,
380with subexpressions counted (starting at 1) by the order of their opening
381parentheses in the RE, left to right.
382Unused entries in the array (corresponding either to subexpressions that
383did not participate in the match at all, or to subexpressions that do not
384exist in the RE (that is,
385.Va i
386>
387.Fa preg Ns -> Ns Va re_nsub ) )
388have both
389.Va rm_so
390and
391.Va rm_eo
392set to -1.
393If a subexpression participated in the match several times,
394the reported substring is the last one it matched.
395(Note, as an example in particular, that when the RE
396.Ql "(b*)+"
397matches
398.Ql bbb ,
399the parenthesized subexpression matches each of the three
400.So Li b Sc Ns s
401and then
402an infinite number of empty strings following the last
403.Ql b ,
404so the reported substring is one of the empties.)
405.Pp
406If
407.Dv REG_STARTEND
408is specified,
409.Fa pmatch
410must point to at least one
411.Ft regmatch_t
412(even if
413.Fa nmatch
414is 0 or
415.Dv REG_NOSUB
416was specified),
417to hold the input offsets for
418.Dv REG_STARTEND .
419Use for output is still entirely controlled by
420.Fa nmatch ;
421if
422.Fa nmatch
423is 0 or
424.Dv REG_NOSUB
425was specified,
426the value of
427.Fa pmatch Ns [0]
428will not be changed by a successful
429.Fn regexec .
430.Pp
431The
432.Fn regerror
433function
434maps a non-zero
435.Fa errcode
436from either
437.Fn regcomp
438or
439.Fn regexec
440to a human-readable, printable message.
441If
442.Fa preg
443is
444.No non\- Ns Dv NULL ,
445the error code should have arisen from use of
446the
447.Ft regex_t
448pointed to by
449.Fa preg ,
450and if the error code came from
451.Fn regcomp ,
452it should have been the result from the most recent
453.Fn regcomp
454using that
455.Ft regex_t .
456The
457.Po
458.Fn regerror
459may be able to supply a more detailed message using information
460from the
461.Ft regex_t .
462.Pc
463The
464.Fn regerror
465function
466places the NUL-terminated message into the buffer pointed to by
467.Fa errbuf ,
468limiting the length (including the NUL) to at most
469.Fa errbuf_size
470bytes.
471If the whole message will not fit,
472as much of it as will fit before the terminating NUL is supplied.
473In any case,
474the returned value is the size of buffer needed to hold the whole
475message (including terminating NUL).
476If
477.Fa errbuf_size
478is 0,
479.Fa errbuf
480is ignored but the return value is still correct.
481.Pp
482If the
483.Fa errcode
484given to
485.Fn regerror
486is first ORed with
487.Dv REG_ITOA ,
488the
489.Dq message
490that results is the printable name of the error code,
491e.g.\&
492.Dq Dv REG_NOMATCH ,
493rather than an explanation thereof.
494If
495.Fa errcode
496is
497.Dv REG_ATOI ,
498then
499.Fa preg
500shall be
501.No non\- Ns Dv NULL
502and the
503.Va re_endp
504member of the structure it points to
505must point to the printable name of an error code;
506in this case, the result in
507.Fa errbuf
508is the decimal digits of
509the numeric value of the error code
510(0 if the name is not recognized).
511.Dv REG_ITOA
512and
513.Dv REG_ATOI
514are intended primarily as debugging facilities;
515they are extensions,
516compatible with but not specified by
517.St -p1003.2 ,
518and should be used with
519caution in software intended to be portable to other systems.
520Be warned also that they are considered experimental and changes are possible.
521.Pp
522The
523.Fn regfree
524function
525frees any dynamically-allocated storage associated with the compiled RE
526pointed to by
527.Fa preg .
528The remaining
529.Ft regex_t
530is no longer a valid compiled RE
531and the effect of supplying it to
532.Fn regexec
533or
534.Fn regerror
535is undefined.
536.Pp
537None of these functions references global variables except for tables
538of constants;
539all are safe for use from multiple threads if the arguments are safe.
540.Sh IMPLEMENTATION CHOICES
541There are a number of decisions that
542.St -p1003.2
543leaves up to the implementor,
544either by explicitly saying
545.Dq undefined
546or by virtue of them being
547forbidden by the RE grammar.
548This implementation treats them as follows.
549.Pp
550See
551.Xr re_format 7
552for a discussion of the definition of case-independent matching.
553.Pp
554There is no particular limit on the length of REs,
555except insofar as memory is limited.
556Memory usage is approximately linear in RE size, and largely insensitive
557to RE complexity, except for bounded repetitions.
558See
559.Sx BUGS
560for one short RE using them
561that will run almost any system out of memory.
562.Pp
563A backslashed character other than one specifically given a magic meaning
564by
565.St -p1003.2
566(such magic meanings occur only in obsolete
567.Bq Dq basic
568REs)
569is taken as an ordinary character.
570.Pp
571Any unmatched
572.Ql [\&
573is a
574.Dv REG_EBRACK
575error.
576.Pp
577Equivalence classes cannot begin or end bracket-expression ranges.
578The endpoint of one range cannot begin another.
579.Pp
580.Dv RE_DUP_MAX ,
581the limit on repetition counts in bounded repetitions, is 255.
582.Pp
583A repetition operator
584.Ql ( ?\& ,
585.Ql *\& ,
586.Ql +\& ,
587or bounds)
588cannot follow another
589repetition operator.
590A repetition operator cannot begin an expression or subexpression
591or follow
592.Ql ^\&
593or
594.Ql |\& .
595.Pp
596.Ql |\&
597cannot appear first or last in a (sub)expression or after another
598.Ql |\& ,
599i.e., an operand of
600.Ql |\&
601cannot be an empty subexpression.
602An empty parenthesized subexpression,
603.Ql "()" ,
604is legal and matches an
605empty (sub)string.
606An empty string is not a legal RE.
607.Pp
608A
609.Ql {\&
610followed by a digit is considered the beginning of bounds for a
611bounded repetition, which must then follow the syntax for bounds.
612A
613.Ql {\&
614.Em not
615followed by a digit is considered an ordinary character.
616.Pp
617.Ql ^\&
618and
619.Ql $\&
620beginning and ending subexpressions in obsolete
621.Pq Dq basic
622REs are anchors, not ordinary characters.
623.Sh DIAGNOSTICS
624Non-zero error codes from
625.Fn regcomp
626and
627.Fn regexec
628include the following:
629.Pp
630.Bl -tag -width REG_ECOLLATE -compact
631.It Dv REG_NOMATCH
632The
633.Fn regexec
634function
635failed to match
636.It Dv REG_BADPAT
637invalid regular expression
638.It Dv REG_ECOLLATE
639invalid collating element
640.It Dv REG_ECTYPE
641invalid character class
642.It Dv REG_EESCAPE
643.Ql \e
644applied to unescapable character
645.It Dv REG_ESUBREG
646invalid backreference number
647.It Dv REG_EBRACK
648brackets
649.Ql "[ ]"
650not balanced
651.It Dv REG_EPAREN
652parentheses
653.Ql "( )"
654not balanced
655.It Dv REG_EBRACE
656braces
657.Ql "{ }"
658not balanced
659.It Dv REG_BADBR
660invalid repetition count(s) in
661.Ql "{ }"
662.It Dv REG_ERANGE
663invalid character range in
664.Ql "[ ]"
665.It Dv REG_ESPACE
666ran out of memory
667.It Dv REG_BADRPT
668.Ql ?\& ,
669.Ql *\& ,
670or
671.Ql +\&
672operand invalid
673.It Dv REG_EMPTY
674empty (sub)expression
675.It Dv REG_ASSERT
676cannot happen - you found a bug
677.It Dv REG_INVARG
678invalid argument, e.g.\& negative-length string
679.It Dv REG_ILLSEQ
680illegal byte sequence (bad multibyte character)
681.El
682.Sh SEE ALSO
683.Xr grep 1 ,
684.Xr re_format 7
685.Pp
686.St -p1003.2 ,
687sections 2.8 (Regular Expression Notation)
688and
689B.5 (C Binding for Regular Expression Matching).
690.Sh HISTORY
691Originally written by
692.An Henry Spencer .
693Altered for inclusion in the
694.Bx 4.4
695distribution.
696.Sh BUGS
697This is an alpha release with known defects.
698Please report problems.
699.Pp
700The back-reference code is subtle and doubts linger about its correctness
701in complex cases.
702.Pp
703The
704.Fn regexec
705function
706performance is poor.
707This will improve with later releases.
708The
709.Fa nmatch
710argument
711exceeding 0 is expensive;
712.Fa nmatch
713exceeding 1 is worse.
714The
715.Fn regexec
716function
717is largely insensitive to RE complexity
718.Em except
719that back
720references are massively expensive.
721RE length does matter; in particular, there is a strong speed bonus
722for keeping RE length under about 30 characters,
723with most special characters counting roughly double.
724.Pp
725The
726.Fn regcomp
727function
728implements bounded repetitions by macro expansion,
729which is costly in time and space if counts are large
730or bounded repetitions are nested.
731An RE like, say,
732.Ql "((((a{1,100}){1,100}){1,100}){1,100}){1,100}"
733will (eventually) run almost any existing machine out of swap space.
734.Pp
735There are suspected problems with response to obscure error conditions.
736Notably,
737certain kinds of internal overflow,
738produced only by truly enormous REs or by multiply nested bounded repetitions,
739are probably not handled well.
740.Pp
741Due to a mistake in
742.St -p1003.2 ,
743things like
744.Ql "a)b"
745are legal REs because
746.Ql )\&
747is
748a special character only in the presence of a previous unmatched
749.Ql (\& .
750This cannot be fixed until the spec is fixed.
751.Pp
752The standard's definition of back references is vague.
753For example, does
754.Ql "a\e(\e(b\e)*\e2\e)*d"
755match
756.Ql "abbbd" ?
757Until the standard is clarified,
758behavior in such cases should not be relied on.
759.Pp
760The implementation of word-boundary matching is a bit of a kludge,
761and bugs may lurk in combinations of word-boundary matching and anchoring.
762.Pp
763Word-boundary matching does not work properly in multibyte locales.
764