xref: /freebsd/lib/libc/regex/regex.3 (revision 0957b409)
1.\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
2.\" Copyright (c) 1992, 1993, 1994
3.\"	The Regents of the University of California.  All rights reserved.
4.\"
5.\" This code is derived from software contributed to Berkeley by
6.\" Henry Spencer.
7.\"
8.\" Redistribution and use in source and binary forms, with or without
9.\" modification, are permitted provided that the following conditions
10.\" are met:
11.\" 1. Redistributions of source code must retain the above copyright
12.\"    notice, this list of conditions and the following disclaimer.
13.\" 2. Redistributions in binary form must reproduce the above copyright
14.\"    notice, this list of conditions and the following disclaimer in the
15.\"    documentation and/or other materials provided with the distribution.
16.\" 3. Neither the name of the University nor the names of its contributors
17.\"    may be used to endorse or promote products derived from this software
18.\"    without specific prior written permission.
19.\"
20.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
21.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
23.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
24.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
30.\" SUCH DAMAGE.
31.\"
32.\"	@(#)regex.3	8.4 (Berkeley) 3/20/94
33.\" $FreeBSD$
34.\"
35.Dd May 25, 2016
36.Dt REGEX 3
37.Os
38.Sh NAME
39.Nm regcomp ,
40.Nm regexec ,
41.Nm regerror ,
42.Nm regfree
43.Nd regular-expression library
44.Sh LIBRARY
45.Lb libc
46.Sh SYNOPSIS
47.In regex.h
48.Ft int
49.Fo regcomp
50.Fa "regex_t * restrict preg" "const char * restrict pattern" "int cflags"
51.Fc
52.Ft int
53.Fo regexec
54.Fa "const regex_t * restrict preg" "const char * restrict string"
55.Fa "size_t nmatch" "regmatch_t pmatch[restrict]" "int eflags"
56.Fc
57.Ft size_t
58.Fo regerror
59.Fa "int errcode" "const regex_t * restrict preg"
60.Fa "char * restrict errbuf" "size_t errbuf_size"
61.Fc
62.Ft void
63.Fn regfree "regex_t *preg"
64.Sh DESCRIPTION
65These routines implement
66.St -p1003.2
67regular expressions
68.Pq Do RE Dc Ns s ;
69see
70.Xr re_format 7 .
71The
72.Fn regcomp
73function
74compiles an RE written as a string into an internal form,
75.Fn regexec
76matches that internal form against a string and reports results,
77.Fn regerror
78transforms error codes from either into human-readable messages,
79and
80.Fn regfree
81frees any dynamically-allocated storage used by the internal form
82of an RE.
83.Pp
84The header
85.In regex.h
86declares two structure types,
87.Ft regex_t
88and
89.Ft regmatch_t ,
90the former for compiled internal forms and the latter for match reporting.
91It also declares the four functions,
92a type
93.Ft regoff_t ,
94and a number of constants with names starting with
95.Dq Dv REG_ .
96.Pp
97The
98.Fn regcomp
99function
100compiles the regular expression contained in the
101.Fa pattern
102string,
103subject to the flags in
104.Fa cflags ,
105and places the results in the
106.Ft regex_t
107structure pointed to by
108.Fa preg .
109The
110.Fa cflags
111argument
112is the bitwise OR of zero or more of the following flags:
113.Bl -tag -width REG_EXTENDED
114.It Dv REG_EXTENDED
115Compile modern
116.Pq Dq extended
117REs,
118rather than the obsolete
119.Pq Dq basic
120REs that
121are the default.
122.It Dv REG_BASIC
123This is a synonym for 0,
124provided as a counterpart to
125.Dv REG_EXTENDED
126to improve readability.
127.It Dv REG_NOSPEC
128Compile with recognition of all special characters turned off.
129All characters are thus considered ordinary,
130so the
131.Dq RE
132is a literal string.
133This is an extension,
134compatible with but not specified by
135.St -p1003.2 ,
136and should be used with
137caution in software intended to be portable to other systems.
138.Dv REG_EXTENDED
139and
140.Dv REG_NOSPEC
141may not be used
142in the same call to
143.Fn regcomp .
144.It Dv REG_ICASE
145Compile for matching that ignores upper/lower case distinctions.
146See
147.Xr re_format 7 .
148.It Dv REG_NOSUB
149Compile for matching that need only report success or failure,
150not what was matched.
151.It Dv REG_NEWLINE
152Compile for newline-sensitive matching.
153By default, newline is a completely ordinary character with no special
154meaning in either REs or strings.
155With this flag,
156.Ql [^
157bracket expressions and
158.Ql .\&
159never match newline,
160a
161.Ql ^\&
162anchor matches the null string after any newline in the string
163in addition to its normal function,
164and the
165.Ql $\&
166anchor matches the null string before any newline in the
167string in addition to its normal function.
168.It Dv REG_PEND
169The regular expression ends,
170not at the first NUL,
171but just before the character pointed to by the
172.Va re_endp
173member of the structure pointed to by
174.Fa preg .
175The
176.Va re_endp
177member is of type
178.Ft "const char *" .
179This flag permits inclusion of NULs in the RE;
180they are considered ordinary characters.
181This is an extension,
182compatible with but not specified by
183.St -p1003.2 ,
184and should be used with
185caution in software intended to be portable to other systems.
186.El
187.Pp
188When successful,
189.Fn regcomp
190returns 0 and fills in the structure pointed to by
191.Fa preg .
192One member of that structure
193(other than
194.Va re_endp )
195is publicized:
196.Va re_nsub ,
197of type
198.Ft size_t ,
199contains the number of parenthesized subexpressions within the RE
200(except that the value of this member is undefined if the
201.Dv REG_NOSUB
202flag was used).
203If
204.Fn regcomp
205fails, it returns a non-zero error code;
206see
207.Sx DIAGNOSTICS .
208.Pp
209The
210.Fn regexec
211function
212matches the compiled RE pointed to by
213.Fa preg
214against the
215.Fa string ,
216subject to the flags in
217.Fa eflags ,
218and reports results using
219.Fa nmatch ,
220.Fa pmatch ,
221and the returned value.
222The RE must have been compiled by a previous invocation of
223.Fn regcomp .
224The compiled form is not altered during execution of
225.Fn regexec ,
226so a single compiled RE can be used simultaneously by multiple threads.
227.Pp
228By default,
229the NUL-terminated string pointed to by
230.Fa string
231is considered to be the text of an entire line, minus any terminating
232newline.
233The
234.Fa eflags
235argument is the bitwise OR of zero or more of the following flags:
236.Bl -tag -width REG_STARTEND
237.It Dv REG_NOTBOL
238The first character of the string is treated as the continuation
239of a line.
240This means that the anchors
241.Ql ^\& ,
242.Ql [[:<:]] ,
243and
244.Ql \e<
245do not match before it; but see
246.Dv REG_STARTEND
247below.
248This does not affect the behavior of newlines under
249.Dv REG_NEWLINE .
250.It Dv REG_NOTEOL
251The NUL terminating
252the string
253does not end a line, so the
254.Ql $\&
255anchor does not match before it.
256This does not affect the behavior of newlines under
257.Dv REG_NEWLINE .
258.It Dv REG_STARTEND
259The string is considered to start at
260.Fa string No +
261.Fa pmatch Ns [0]. Ns Fa rm_so
262and to end before the byte located at
263.Fa string No +
264.Fa pmatch Ns [0]. Ns Fa rm_eo ,
265regardless of the value of
266.Fa nmatch .
267See below for the definition of
268.Fa pmatch
269and
270.Fa nmatch .
271This is an extension,
272compatible with but not specified by
273.St -p1003.2 ,
274and should be used with
275caution in software intended to be portable to other systems.
276.Pp
277Without
278.Dv REG_NOTBOL ,
279the position
280.Fa rm_so
281is considered the beginning of a line, such that
282.Ql ^
283matches before it, and the beginning of a word if there is a word
284character at this position, such that
285.Ql [[:<:]]
286and
287.Ql \e<
288match before it.
289.Pp
290With
291.Dv REG_NOTBOL ,
292the character at position
293.Fa rm_so
294is treated as the continuation of a line, and if
295.Fa rm_so
296is greater than 0, the preceding character is taken into consideration.
297If the preceding character is a newline and the regular expression was compiled
298with
299.Dv REG_NEWLINE ,
300.Ql ^
301matches before the string; if the preceding character is not a word character
302but the string starts with a word character,
303.Ql [[:<:]]
304and
305.Ql \e<
306match before the string.
307.El
308.Pp
309See
310.Xr re_format 7
311for a discussion of what is matched in situations where an RE or a
312portion thereof could match any of several substrings of
313.Fa string .
314.Pp
315Normally,
316.Fn regexec
317returns 0 for success and the non-zero code
318.Dv REG_NOMATCH
319for failure.
320Other non-zero error codes may be returned in exceptional situations;
321see
322.Sx DIAGNOSTICS .
323.Pp
324If
325.Dv REG_NOSUB
326was specified in the compilation of the RE,
327or if
328.Fa nmatch
329is 0,
330.Fn regexec
331ignores the
332.Fa pmatch
333argument (but see below for the case where
334.Dv REG_STARTEND
335is specified).
336Otherwise,
337.Fa pmatch
338points to an array of
339.Fa nmatch
340structures of type
341.Ft regmatch_t .
342Such a structure has at least the members
343.Va rm_so
344and
345.Va rm_eo ,
346both of type
347.Ft regoff_t
348(a signed arithmetic type at least as large as an
349.Ft off_t
350and a
351.Ft ssize_t ) ,
352containing respectively the offset of the first character of a substring
353and the offset of the first character after the end of the substring.
354Offsets are measured from the beginning of the
355.Fa string
356argument given to
357.Fn regexec .
358An empty substring is denoted by equal offsets,
359both indicating the character following the empty substring.
360.Pp
361The 0th member of the
362.Fa pmatch
363array is filled in to indicate what substring of
364.Fa string
365was matched by the entire RE.
366Remaining members report what substring was matched by parenthesized
367subexpressions within the RE;
368member
369.Va i
370reports subexpression
371.Va i ,
372with subexpressions counted (starting at 1) by the order of their opening
373parentheses in the RE, left to right.
374Unused entries in the array (corresponding either to subexpressions that
375did not participate in the match at all, or to subexpressions that do not
376exist in the RE (that is,
377.Va i
378>
379.Fa preg Ns -> Ns Va re_nsub ) )
380have both
381.Va rm_so
382and
383.Va rm_eo
384set to -1.
385If a subexpression participated in the match several times,
386the reported substring is the last one it matched.
387(Note, as an example in particular, that when the RE
388.Ql "(b*)+"
389matches
390.Ql bbb ,
391the parenthesized subexpression matches each of the three
392.So Li b Sc Ns s
393and then
394an infinite number of empty strings following the last
395.Ql b ,
396so the reported substring is one of the empties.)
397.Pp
398If
399.Dv REG_STARTEND
400is specified,
401.Fa pmatch
402must point to at least one
403.Ft regmatch_t
404(even if
405.Fa nmatch
406is 0 or
407.Dv REG_NOSUB
408was specified),
409to hold the input offsets for
410.Dv REG_STARTEND .
411Use for output is still entirely controlled by
412.Fa nmatch ;
413if
414.Fa nmatch
415is 0 or
416.Dv REG_NOSUB
417was specified,
418the value of
419.Fa pmatch Ns [0]
420will not be changed by a successful
421.Fn regexec .
422.Pp
423The
424.Fn regerror
425function
426maps a non-zero
427.Fa errcode
428from either
429.Fn regcomp
430or
431.Fn regexec
432to a human-readable, printable message.
433If
434.Fa preg
435is
436.No non\- Ns Dv NULL ,
437the error code should have arisen from use of
438the
439.Ft regex_t
440pointed to by
441.Fa preg ,
442and if the error code came from
443.Fn regcomp ,
444it should have been the result from the most recent
445.Fn regcomp
446using that
447.Ft regex_t .
448The
449.Po
450.Fn regerror
451may be able to supply a more detailed message using information
452from the
453.Ft regex_t .
454.Pc
455The
456.Fn regerror
457function
458places the NUL-terminated message into the buffer pointed to by
459.Fa errbuf ,
460limiting the length (including the NUL) to at most
461.Fa errbuf_size
462bytes.
463If the whole message will not fit,
464as much of it as will fit before the terminating NUL is supplied.
465In any case,
466the returned value is the size of buffer needed to hold the whole
467message (including terminating NUL).
468If
469.Fa errbuf_size
470is 0,
471.Fa errbuf
472is ignored but the return value is still correct.
473.Pp
474If the
475.Fa errcode
476given to
477.Fn regerror
478is first ORed with
479.Dv REG_ITOA ,
480the
481.Dq message
482that results is the printable name of the error code,
483e.g.\&
484.Dq Dv REG_NOMATCH ,
485rather than an explanation thereof.
486If
487.Fa errcode
488is
489.Dv REG_ATOI ,
490then
491.Fa preg
492shall be
493.No non\- Ns Dv NULL
494and the
495.Va re_endp
496member of the structure it points to
497must point to the printable name of an error code;
498in this case, the result in
499.Fa errbuf
500is the decimal digits of
501the numeric value of the error code
502(0 if the name is not recognized).
503.Dv REG_ITOA
504and
505.Dv REG_ATOI
506are intended primarily as debugging facilities;
507they are extensions,
508compatible with but not specified by
509.St -p1003.2 ,
510and should be used with
511caution in software intended to be portable to other systems.
512Be warned also that they are considered experimental and changes are possible.
513.Pp
514The
515.Fn regfree
516function
517frees any dynamically-allocated storage associated with the compiled RE
518pointed to by
519.Fa preg .
520The remaining
521.Ft regex_t
522is no longer a valid compiled RE
523and the effect of supplying it to
524.Fn regexec
525or
526.Fn regerror
527is undefined.
528.Pp
529None of these functions references global variables except for tables
530of constants;
531all are safe for use from multiple threads if the arguments are safe.
532.Sh IMPLEMENTATION CHOICES
533There are a number of decisions that
534.St -p1003.2
535leaves up to the implementor,
536either by explicitly saying
537.Dq undefined
538or by virtue of them being
539forbidden by the RE grammar.
540This implementation treats them as follows.
541.Pp
542See
543.Xr re_format 7
544for a discussion of the definition of case-independent matching.
545.Pp
546There is no particular limit on the length of REs,
547except insofar as memory is limited.
548Memory usage is approximately linear in RE size, and largely insensitive
549to RE complexity, except for bounded repetitions.
550See
551.Sx BUGS
552for one short RE using them
553that will run almost any system out of memory.
554.Pp
555A backslashed character other than one specifically given a magic meaning
556by
557.St -p1003.2
558(such magic meanings occur only in obsolete
559.Bq Dq basic
560REs)
561is taken as an ordinary character.
562.Pp
563Any unmatched
564.Ql [\&
565is a
566.Dv REG_EBRACK
567error.
568.Pp
569Equivalence classes cannot begin or end bracket-expression ranges.
570The endpoint of one range cannot begin another.
571.Pp
572.Dv RE_DUP_MAX ,
573the limit on repetition counts in bounded repetitions, is 255.
574.Pp
575A repetition operator
576.Ql ( ?\& ,
577.Ql *\& ,
578.Ql +\& ,
579or bounds)
580cannot follow another
581repetition operator.
582A repetition operator cannot begin an expression or subexpression
583or follow
584.Ql ^\&
585or
586.Ql |\& .
587.Pp
588.Ql |\&
589cannot appear first or last in a (sub)expression or after another
590.Ql |\& ,
591i.e., an operand of
592.Ql |\&
593cannot be an empty subexpression.
594An empty parenthesized subexpression,
595.Ql "()" ,
596is legal and matches an
597empty (sub)string.
598An empty string is not a legal RE.
599.Pp
600A
601.Ql {\&
602followed by a digit is considered the beginning of bounds for a
603bounded repetition, which must then follow the syntax for bounds.
604A
605.Ql {\&
606.Em not
607followed by a digit is considered an ordinary character.
608.Pp
609.Ql ^\&
610and
611.Ql $\&
612beginning and ending subexpressions in obsolete
613.Pq Dq basic
614REs are anchors, not ordinary characters.
615.Sh DIAGNOSTICS
616Non-zero error codes from
617.Fn regcomp
618and
619.Fn regexec
620include the following:
621.Pp
622.Bl -tag -width REG_ECOLLATE -compact
623.It Dv REG_NOMATCH
624The
625.Fn regexec
626function
627failed to match
628.It Dv REG_BADPAT
629invalid regular expression
630.It Dv REG_ECOLLATE
631invalid collating element
632.It Dv REG_ECTYPE
633invalid character class
634.It Dv REG_EESCAPE
635.Ql \e
636applied to unescapable character
637.It Dv REG_ESUBREG
638invalid backreference number
639.It Dv REG_EBRACK
640brackets
641.Ql "[ ]"
642not balanced
643.It Dv REG_EPAREN
644parentheses
645.Ql "( )"
646not balanced
647.It Dv REG_EBRACE
648braces
649.Ql "{ }"
650not balanced
651.It Dv REG_BADBR
652invalid repetition count(s) in
653.Ql "{ }"
654.It Dv REG_ERANGE
655invalid character range in
656.Ql "[ ]"
657.It Dv REG_ESPACE
658ran out of memory
659.It Dv REG_BADRPT
660.Ql ?\& ,
661.Ql *\& ,
662or
663.Ql +\&
664operand invalid
665.It Dv REG_EMPTY
666empty (sub)expression
667.It Dv REG_ASSERT
668cannot happen - you found a bug
669.It Dv REG_INVARG
670invalid argument, e.g.\& negative-length string
671.It Dv REG_ILLSEQ
672illegal byte sequence (bad multibyte character)
673.El
674.Sh SEE ALSO
675.Xr grep 1 ,
676.Xr re_format 7
677.Pp
678.St -p1003.2 ,
679sections 2.8 (Regular Expression Notation)
680and
681B.5 (C Binding for Regular Expression Matching).
682.Sh HISTORY
683Originally written by
684.An Henry Spencer .
685Altered for inclusion in the
686.Bx 4.4
687distribution.
688.Sh BUGS
689This is an alpha release with known defects.
690Please report problems.
691.Pp
692The back-reference code is subtle and doubts linger about its correctness
693in complex cases.
694.Pp
695The
696.Fn regexec
697function
698performance is poor.
699This will improve with later releases.
700The
701.Fa nmatch
702argument
703exceeding 0 is expensive;
704.Fa nmatch
705exceeding 1 is worse.
706The
707.Fn regexec
708function
709is largely insensitive to RE complexity
710.Em except
711that back
712references are massively expensive.
713RE length does matter; in particular, there is a strong speed bonus
714for keeping RE length under about 30 characters,
715with most special characters counting roughly double.
716.Pp
717The
718.Fn regcomp
719function
720implements bounded repetitions by macro expansion,
721which is costly in time and space if counts are large
722or bounded repetitions are nested.
723An RE like, say,
724.Ql "((((a{1,100}){1,100}){1,100}){1,100}){1,100}"
725will (eventually) run almost any existing machine out of swap space.
726.Pp
727There are suspected problems with response to obscure error conditions.
728Notably,
729certain kinds of internal overflow,
730produced only by truly enormous REs or by multiply nested bounded repetitions,
731are probably not handled well.
732.Pp
733Due to a mistake in
734.St -p1003.2 ,
735things like
736.Ql "a)b"
737are legal REs because
738.Ql )\&
739is
740a special character only in the presence of a previous unmatched
741.Ql (\& .
742This cannot be fixed until the spec is fixed.
743.Pp
744The standard's definition of back references is vague.
745For example, does
746.Ql "a\e(\e(b\e)*\e2\e)*d"
747match
748.Ql "abbbd" ?
749Until the standard is clarified,
750behavior in such cases should not be relied on.
751.Pp
752The implementation of word-boundary matching is a bit of a kludge,
753and bugs may lurk in combinations of word-boundary matching and anchoring.
754.Pp
755Word-boundary matching does not work properly in multibyte locales.
756