xref: /dragonfly/lib/libc/tre-regex/re_format.7 (revision e4adeac1)
1.\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
2.\" Copyright (c) 1992, 1993, 1994
3.\"	The Regents of the University of California.  All rights reserved.
4.\"
5.\" This code is derived from software contributed to Berkeley by
6.\" Henry Spencer.
7.\"
8.\" Redistribution and use in source and binary forms, with or without
9.\" modification, are permitted provided that the following conditions
10.\" are met:
11.\" 1. Redistributions of source code must retain the above copyright
12.\"    notice, this list of conditions and the following disclaimer.
13.\" 2. Redistributions in binary form must reproduce the above copyright
14.\"    notice, this list of conditions and the following disclaimer in the
15.\"    documentation and/or other materials provided with the distribution.
16.\" 3. Neither the name of the University nor the names of its contributors
17.\"    may be used to endorse or promote products derived from this software
18.\"    without specific prior written permission.
19.\"
20.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
21.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
23.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
24.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
30.\" SUCH DAMAGE.
31.\"
32.\"	@(#)re_format.7	8.3 (Berkeley) 3/20/94
33.\" $FreeBSD: src/lib/libc/regex/re_format.7,v 1.12 2008/09/05 17:41:20 keramida Exp $
34.\"
35.Dd August 6, 2015
36.Dt RE_FORMAT 7
37.Os
38.Sh NAME
39.Nm re_format
40.Nd POSIX 1003.2 regular expressions
41.Sh DESCRIPTION
42Regular expressions
43.Pq Dq RE Ns s ,
44as defined in
45.St -p1003.2 ,
46come in two forms:
47modern REs (roughly those of
48.Xr egrep 1 ;
491003.2 calls these
50.Dq extended
51REs)
52and obsolete REs (roughly those of
53.Xr ed 1 ;
541003.2
55.Dq basic
56REs).
57Obsolete REs mostly exist for backward compatibility in some old programs;
58they will be discussed at the end.
59.St -p1003.2
60leaves some aspects of RE syntax and semantics open;
61`\(dd' marks decisions on these aspects that
62may not be fully portable to other
63.St -p1003.2
64implementations.
65.Pp
66A (modern) RE is one\(dd or more non-empty\(dd
67.Em branches ,
68separated by
69.Ql \&| .
70It matches anything that matches one of the branches.
71.Pp
72A branch is one\(dd or more
73.Em pieces ,
74concatenated.
75It matches a match for the first, followed by a match for the second, etc.
76.Pp
77A piece is an
78.Em atom
79possibly followed
80by a single\(dd
81.Ql \&* ,
82.Ql \&+ ,
83.Ql \&? ,
84or
85.Em bound .
86An atom followed by
87.Ql \&*
88matches a sequence of 0 or more matches of the atom.
89An atom followed by
90.Ql \&+
91matches a sequence of 1 or more matches of the atom.
92An atom followed by
93.Ql ?\&
94matches a sequence of 0 or 1 matches of the atom.
95.Pp
96A
97.Em bound
98is
99.Ql \&{
100followed by an unsigned decimal integer,
101possibly followed by
102.Ql \&,
103possibly followed by another unsigned decimal integer,
104always followed by
105.Ql \&} .
106The integers must lie between 0 and
107.Dv RE_DUP_MAX
108(255\(dd) inclusive,
109and if there are two of them, the first may not exceed the second.
110An atom followed by a bound containing one integer
111.Em i
112and no comma matches
113a sequence of exactly
114.Em i
115matches of the atom.
116An atom followed by a bound
117containing one integer
118.Em i
119and a comma matches
120a sequence of
121.Em i
122or more matches of the atom.
123An atom followed by a bound
124containing two integers
125.Em i
126and
127.Em j
128matches
129a sequence of
130.Em i
131through
132.Em j
133(inclusive) matches of the atom.
134.Pp
135An atom is a regular expression enclosed in
136.Ql ()
137(matching a match for the
138regular expression),
139an empty set of
140.Ql ()
141(matching the null string)\(dd,
142a
143.Em bracket expression
144(see below),
145.Ql .\&
146(matching any single character),
147.Ql \&^
148(matching the null string at the beginning of a line),
149.Ql \&$
150(matching the null string at the end of a line), a
151.Ql \e
152followed by one of the characters
153.Ql ^.[$()|*+?{\e
154(matching that character taken as an ordinary character),
155a
156.Ql \e
157followed by any other character\(dd
158(matching that character taken as an ordinary character,
159as if the
160.Ql \e
161had not been present\(dd),
162or a single character with no other significance (matching that character).
163A
164.Ql \&{
165followed by a character other than a digit is an ordinary
166character, not the beginning of a bound\(dd.
167It is illegal to end an RE with
168.Ql \e .
169.Pp
170A
171.Em bracket expression
172is a list of characters enclosed in
173.Ql [] .
174It normally matches any single character from the list (but see below).
175If the list begins with
176.Ql \&^ ,
177it matches any single character
178(but see below)
179.Em not
180from the rest of the list.
181If two characters in the list are separated by
182.Ql \&- ,
183this is shorthand
184for the full
185.Em range
186of characters between those two (inclusive) in the
187collating sequence,
188.No e.g. Ql [0-9]
189in ASCII matches any decimal digit.
190It is illegal\(dd for two ranges to share an
191endpoint,
192.No e.g. Ql a-c-e .
193Ranges are very collating-sequence-dependent,
194and portable programs should avoid relying on them.
195.Pp
196To include a literal
197.Ql \&]
198in the list, make it the first character
199(following a possible
200.Ql \&^ ) .
201To include a literal
202.Ql \&- ,
203make it the first or last character,
204or the second endpoint of a range.
205To use a literal
206.Ql \&-
207as the first endpoint of a range,
208enclose it in
209.Ql [.\&
210and
211.Ql .]\&
212to make it a collating element (see below).
213With the exception of these and some combinations using
214.Ql \&[
215(see next paragraphs), all other special characters, including
216.Ql \e ,
217lose their special significance within a bracket expression.
218.Pp
219Within a bracket expression, a collating element (a character,
220a multi-character sequence that collates as if it was a single character,
221or a collating-sequence name for either)
222enclosed in
223.Ql [.\&
224and
225.Ql .]\&
226stands for the
227sequence of characters of that collating element.
228The sequence is a single element of the bracket expression's list.
229A bracket expression containing a multi-character collating element
230can thus match more than one character,
231e.g.\& if the collating sequence includes a
232.Ql ch
233collating element,
234then the RE
235.Ql [[.ch.]]*c
236matches the first five characters
237of
238.Ql chchcc .
239.Pp
240Within a bracket expression, a collating element enclosed in
241.Ql [=
242and
243.Ql =]
244is an equivalence class, standing for the sequences of characters
245of all collating elements equivalent to that one, including itself.
246(If there are no other equivalent collating elements,
247the treatment is as if the enclosing delimiters were
248.Ql [.\&
249and
250.Ql .] . )
251For example, if
252.Ql x
253and
254.Ql y
255are the members of an equivalence class,
256then
257.Ql [[=x=]] ,
258.Ql [[=y=]] ,
259and
260.Ql [xy]
261are all synonymous.
262An equivalence class may not\(dd be an endpoint
263of a range.
264.Pp
265Within a bracket expression, the name of a
266.Em character class
267enclosed in
268.Ql [:
269and
270.Ql :]
271stands for the list of all characters belonging to that
272class.
273Standard character class names are:
274.Bl -column "alnum" "digit" "xdigit" -offset indent
275.It Em "alnum	digit	punct"
276.It Em "alpha	graph	space"
277.It Em "blank	lower	upper"
278.It Em "cntrl	print	xdigit"
279.El
280.Pp
281These stand for the character classes defined in
282.Xr ctype 3 .
283A locale may provide others.
284A character class may not be used as an endpoint of a range.
285.Pp
286A bracketed expression like
287.Ql [[:class:]]
288can be used to match a single character that belongs to a character
289class.
290The reverse, matching any character that does not belong to a specific
291class, the negation operator of bracket expressions may be used:
292.Ql [^[:class:]] .
293.Pp
294There are two special cases\(dd of bracket expressions:
295the bracket expressions
296.Ql [[:<:]]
297and
298.Ql [[:>:]]
299match the null string at the beginning and end of a word respectively.
300A word is defined as a sequence of word characters
301which is neither preceded nor followed by
302word characters.
303A word character is an
304.Em alnum
305character (as defined by
306.Xr ctype 3 )
307or an underscore.
308This is an extension,
309compatible with but not specified by
310.St -p1003.2 ,
311and should be used with
312caution in software intended to be portable to other systems.
313.Pp
314In the event that an RE could match more than one substring of a given
315string,
316the RE matches the one starting earliest in the string.
317If the RE could match more than one substring starting at that point,
318it matches the longest.
319Subexpressions also match the longest possible substrings, subject to
320the constraint that the whole match be as long as possible,
321with subexpressions starting earlier in the RE taking priority over
322ones starting later.
323Note that higher-level subexpressions thus take priority over
324their lower-level component subexpressions.
325.Pp
326Match lengths are measured in characters, not collating elements.
327A null string is considered longer than no match at all.
328For example,
329.Ql bb*
330matches the three middle characters of
331.Ql abbbc ,
332.Ql (wee|week)(knights|nights)
333matches all ten characters of
334.Ql weeknights ,
335when
336.Ql (.*).*\&
337is matched against
338.Ql abc
339the parenthesized subexpression
340matches all three characters, and
341when
342.Ql (a*)*
343is matched against
344.Ql bc
345both the whole RE and the parenthesized
346subexpression match the null string.
347.Pp
348If case-independent matching is specified,
349the effect is much as if all case distinctions had vanished from the
350alphabet.
351When an alphabetic that exists in multiple cases appears as an
352ordinary character outside a bracket expression, it is effectively
353transformed into a bracket expression containing both cases,
354.No e.g. Ql x
355becomes
356.Ql [xX] .
357When it appears inside a bracket expression, all case counterparts
358of it are added to the bracket expression, so that (e.g.)
359.Ql [x]
360becomes
361.Ql [xX]
362and
363.Ql [^x]
364becomes
365.Ql [^xX] .
366.Pp
367No particular limit is imposed on the length of REs\(dd.
368Programs intended to be portable should not employ REs longer
369than 256 bytes,
370as an implementation can refuse to accept such REs and remain
371POSIX-compliant.
372.Pp
373Obsolete
374.Pq Dq basic
375regular expressions differ in several respects.
376.Ql \&|
377is an ordinary character and there is no equivalent
378for its functionality.
379.Ql \&+
380and
381.Ql ?\&
382are ordinary characters, and their functionality
383can be expressed using bounds
384.No ( Ql {1,}
385or
386.Ql {0,1}
387respectively).
388Also note that
389.Ql x+
390in modern REs is equivalent to
391.Ql xx* .
392The delimiters for bounds are
393.Ql \e{
394and
395.Ql \e} ,
396with
397.Ql \&{
398and
399.Ql \&}
400by themselves ordinary characters.
401The parentheses for nested subexpressions are
402.Ql \e(
403and
404.Ql \e) ,
405with
406.Ql \&(
407and
408.Ql \&)
409by themselves ordinary characters.
410.Ql \&^
411is an ordinary character except at the beginning of the
412RE or\(dd the beginning of a parenthesized subexpression,
413.Ql \&$
414is an ordinary character except at the end of the
415RE or\(dd the end of a parenthesized subexpression,
416and
417.Ql \&*
418is an ordinary character if it appears at the beginning of the
419RE or the beginning of a parenthesized subexpression
420(after a possible leading
421.Ql \&^ ) .
422Finally, there is one new type of atom, a
423.Em back reference :
424.Ql \e
425followed by a non-zero decimal digit
426.Em d
427matches the same sequence of characters
428matched by the
429.Em d Ns th
430parenthesized subexpression
431(numbering subexpressions by the positions of their opening parentheses,
432left to right),
433so that (e.g.)
434.Ql \e([bc]\e)\e1
435matches
436.Ql bb
437or
438.Ql cc
439but not
440.Ql bc .
441.Sh ENHANCED FEATURES
442When the
443.Dv REG_ENHANCED
444flag is passed to one of the
445.Fn regcomp
446variants, additional features are activated.
447Like the enhanced
448.Nm regex
449implementations in scripting languages such as
450.Xr perl 1
451and
452.Xr python 1 ,
453these additional features may conflict with the
454.St -p1003.2
455standards in some ways.
456Use this with care in situations which require portability
457(including to past versions of the Mac OS X using the previous
458.Nm regex
459implementation).
460.Pp
461For enhanced basic REs,
462.Ql \&+ ,
463.Ql \&?
464and
465.Ql \&|
466remain regular characters, but
467.Ql \e+ ,
468.Ql \e?
469and
470.Ql \e|
471have the same special meaning as the unescaped characters do for
472extended REs, i.e., one or more matches, zero or one matches and alteration,
473respectively.
474For enhanced extended REs,
475back references are available.
476Additional enhanced features are listed below.
477.Pp
478Within a bracket expression, most characters lose their magic.
479This also applies to the additional enhanced features, which don't operate
480inside a bracket expression.
481.Ss Assertions (available for both enhanced basic and enhanced extended REs)
482In addition to
483.Ql \&^
484and
485.Ql \&$
486(the assertions that match the null string at the beginning and end of line,
487respectively), the following assertions become available:
488.Bl -tag -width ".Sy \eB" -offset indent
489.It Sy \e<
490Matches the null string at the beginning of a word.
491This is equivalent to
492.Ql [[:<:]] .
493.It Sy \e>
494Matches the null string at the end of a word.
495This is equivalent to
496.Ql [[:>:]] .
497.It Sy \eb
498Matches the null string at a word boundary (either the beginning or end of
499a word).
500.It Sy \eB
501Matches the null string where there is no word boundary.
502This is the opposite of
503.Ql \eb .
504.El
505.Ss Shortcuts (available for both enhanced basic and enhanced extended REs)
506The following shortcuts can be used to replace more complicated
507bracket expressions.
508.Bl -tag -width ".Sy \eD" -offset indent
509.It Sy \ed
510Matches a digit character.
511This is equivalent to
512.Ql [[:digit:]] .
513.It Sy \eD
514Matches a non-digit character.
515This is equivalent to
516.Ql [^[:digit:]] .
517.It Sy \es
518Matches a space character.
519This is equivalent to
520.Ql [[:space:]] .
521.It Sy \eS
522Matches a non-space character.
523This is equivalent to
524.Ql [^[:space:]] .
525.It Sy \ew
526Matches a word character.
527This is equivalent to
528.Ql [[:alnum:]_] .
529.It Sy \eW
530Matches a non-word character.
531This is equivalent to
532.Ql [^[:alnum:]_] .
533.El
534.Ss Literal Sequences (available for both enhanced basic and enhanced extended REs)
535Literals are normally just ordinary characters that are matched directly.
536Under enhanced mode, certain character sequences are
537converted to specific literals.
538.Bl -tag -width ".Sy \ea" -offset indent
539.It Sy \ea
540The
541.Dq bell
542character (ASCII code 7).
543.It Sy \ee
544The
545.Dq escape
546character (ASCII code 27).
547.It Sy \ef
548The
549.Dq form-feed
550character (ASCII code 12).
551.It Sy \en
552The
553.Dq new-line/line-feed
554character (ASCII code 10).
555.It Sy \er
556The
557.Dq carriage-return
558character (ASCII code 13).
559.It Sy \et
560The
561.Dq horizontal-tab
562character (ASCII code 9).
563.El
564.Pp
565Literals can also be specified directly, using their wide character values.
566Note that when matching a multibyte character string, the string's bytes
567are converted to wide character before comparing.
568This means that a single literal wide character value may match more than
569one string byte, depending on the locale's wide character encoding.
570.Bl -tag -width ".Sy \ex{ Ns Em x.. Ns Sy \&}" -offset indent
571.It Sy \ex Ns Em x..
572An arbitray eight-bit value.
573The
574.Em x..
575sequence represents zero, one or two hexadecimal digits.
576(Note: if
577.Em x..
578is less than two hexadecimal digits, and the character following this sequence
579happens to be a hexadecimal digit, use the (following) brace form to avoid
580confusion.)
581.It Sy \ex{ Ns Em x.. Ns Sy \&}
582An arbitrary, up to 32-bit value.
583The
584.Em x..
585sequence is an arbitrary sequence of hexadecimal digits that is long enough
586to represent the necessary value.
587.El
588.Ss Inline Literal Mode (available for both enhanced basic and enhanced extended REs)
589A
590.Ql \eQ
591sequence causes literal
592.Pq Dq quote
593mode to be entered,
594while
595.Ql \eE
596ends literal mode, and returns to normal regular expression processing.
597This is similar to specifying the
598.Dv REG_NOSPEC
599(or
600.Dv REG_LITERAL )
601option to
602.Fn regcomp ,
603except that rather than applying to the whole RE string, it only applies to
604the part between the
605.Ql \eQ
606and
607.Ql \eE .
608Note that it is not possible to have a
609.Ql \eE
610in the middle of an inline literal range, as that would terminate literal mode
611prematurely.
612.Ss Minimal Repetitions (available for enhanced extended REs only)
613By default, the repetition operators,
614.Ql \&* ,
615.Em bound ,
616.Ql \&?
617and
618.Ql \&+
619are
620.Em greedy ;
621they try to match as many times as possible.
622In enhanced mode, appending a
623.Ql \&?
624to a repetition operator makes it minimal (or
625.Em ungreedy ) ;
626it tries to match the fewest number of times (including zero times, as
627appropriate).
628.Pp
629For example, against the string
630.Ql aaa ,
631the RE
632.Ql a*
633would match the entire string,
634while
635.Ql a*?
636would match the null string at the beginning of the line
637(matches zero times).
638Likewise, against the string
639.Ql ababab ,
640the RE
641.Ql .*b ,
642would also match the entire string,
643while
644.Ql .*?b
645would only match the first two characters.
646.Pp
647The
648.Fn regcomp
649flag
650.Dv REG_UNGREEDY
651will make the regular
652.Pq greedy
653repetition operators ungreedy by default.
654Appending
655.Ql \&?
656makes them greedy again.
657.Pp
658Note that minimal repetitions are not specified by an official
659standard, so there may be differences between different implementations.
660In the current implementation, minimal repetitions have a high precedence,
661and can cause other standards requirements to be violated.
662For instance, on the string
663.Ql aaaaa ,
664the RE
665.Ql (aaa??)*
666will only match the first four characters, violating the rules that the longest
667possible match is made and the longest subexpressions are matched.
668Using
669.Ql (aaa??)*$
670forces the entire string to be matched.
671.Ss Non-capturing Parenthesized Subexpressions (available for enhanced extended REs only)
672Normally, the match offsets to parenthesized subexpressions are
673recorded in the
674.Fa pmatch
675array (that is, when
676.Dv REG_NOSUB
677is not specified, and
678.Fa nmatch
679is large enough to encompass the parenthesized subexpression in question).
680In enhanced mode, if the first two characters following the left parenthesis
681are
682.Ql ?: ,
683grouping of the remaining contents is done, but the corresponding offsets are
684not recorded in the
685.Fa pmatch
686array.
687For example, against the string
688.Ql fubar ,
689the RE
690.Ql (fu)(bar)
691would have two subexpression matches in
692.Fa pmatch ;
693the first for
694.Ql fu
695and the second for
696.Ql bar .
697But with the RE
698.Ql (?:fu)(bar) ,
699there would only be one subexpression match, that of
700.Ql bar .
701Furthermore,
702against the string
703.Ql fufubar ,
704the RE
705.Ql (?fu)*(bar)
706would again match the entire string, but only
707.Ql bar
708would be recorded in
709.Fa pmatch .
710.Ss Inline Options (available for enhanced extended REs only)
711Like the inline literal mode mentioned above, other options can be switched
712on and off for part of a RE.
713.Ql (? Ns Em o.. Ns \&)
714will turn on the options specified in
715.Em o..
716(one or more options characters; see below), while
717.Ql (?- Ns Em o.. Ns \&)
718will turn off the specified options, and
719.Ql (? Ns Em o1.. Ns \&- Ns Em o2.. Ns \&)
720will turn on the first set of options, and turn off the second set.
721.Pp
722The available options are:
723.Bl -tag -width ".Sy \&U" -offset indent
724.It Sy \&i
725Turning on this option will ignore case during matching, while turning off
726will restore case-sensitive matching.
727If
728.Dv REG_ICASE
729was specified to
730.Fn regcomp ,
731this option can be used to turn that off.
732.It Sy \&n
733Turn on or off special handling of the newline character.
734If
735.Dv REG_NEWLINE
736was specified to
737.Fn regcomp ,
738this option can be used to turn that off.
739.It Sy \&U
740Turning on this option will make ungreedy repetitions the default, while
741turning off will make greedy repetitions the default.
742If
743.Dv REG_UNGREEDY
744was specified to
745.Fn regcomp ,
746this option can be used to turn that off.
747.El
748.Pp
749The scope of the option change begins immediately following the right
750parenthesis,
751but up to the end of the enclosing subexpression (if any).
752Thus, for example, given the RE
753.Ql (fu(?i)bar)baz ,
754the
755.Ql fu
756portion matches case sensitively,
757.Ql bar
758matches case insensitively, and
759.Ql baz
760matches case sensitively again (since is it outside the scope of the
761subexpression in which the inline option was specified).
762.Pp
763The inline options syntax can be combined with the non-capturing parenthesized
764subexpression to limit the option scope to just that of the subexpression.
765Then, for example,
766.Ql fu(?i:bar)baz
767is similar to the previous example, except for the parenthesize subexpression
768around
769.Ql fu(?i)bar
770in the previous example.
771.Ss Inline Comments (available for enhanced extended REs only)
772The syntax
773.Ql (?# Ns Em comment Ns \&)
774can be used to embed comments within a RE.
775Note that
776.Em comment
777can not contain a right parenthesis.
778Also note that while syntactically, option characters can be added before
779the
780.Ql \&#
781character, they will be ignored.
782.Sh SEE ALSO
783.Xr regex 3
784.Rs
785.%T Regular Expression Notation
786.%R IEEE Std
787.%N 1003.2
788.%P section 2.8
789.Re
790.Sh BUGS
791Having two kinds of REs is a botch.
792.Pp
793The current
794.St -p1003.2
795spec says that
796.Ql \&)
797is an ordinary character in
798the absence of an unmatched
799.Ql \&( ;
800this was an unintentional result of a wording error,
801and change is likely.
802Avoid relying on it.
803.Pp
804Back references are a dreadful botch,
805posing major problems for efficient implementations.
806They are also somewhat vaguely defined
807(does
808.Ql a\e(\e(b\e)*\e2\e)*d
809match
810.Ql abbbd ? ) .
811Avoid using them.
812.Pp
813.St -p1003.2
814specification of case-independent matching is vague.
815The
816.Dq one case implies all cases
817definition given above
818is current consensus among implementors as to the right interpretation.
819.Pp
820The bracket syntax for word boundaries is incredibly ugly.
821