xref: /freebsd/lib/libc/regex/re_format.7 (revision dc36d6f9)
1.\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
2.\" Copyright (c) 1992, 1993, 1994
3.\"	The Regents of the University of California.  All rights reserved.
4.\"
5.\" This code is derived from software contributed to Berkeley by
6.\" Henry Spencer.
7.\"
8.\" Redistribution and use in source and binary forms, with or without
9.\" modification, are permitted provided that the following conditions
10.\" are met:
11.\" 1. Redistributions of source code must retain the above copyright
12.\"    notice, this list of conditions and the following disclaimer.
13.\" 2. Redistributions in binary form must reproduce the above copyright
14.\"    notice, this list of conditions and the following disclaimer in the
15.\"    documentation and/or other materials provided with the distribution.
16.\" 3. All advertising materials mentioning features or use of this software
17.\"    must display the following acknowledgement:
18.\"	This product includes software developed by the University of
19.\"	California, Berkeley and its contributors.
20.\" 4. Neither the name of the University nor the names of its contributors
21.\"    may be used to endorse or promote products derived from this software
22.\"    without specific prior written permission.
23.\"
24.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
25.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
26.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
27.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
28.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
29.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
30.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
31.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
32.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
33.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
34.\" SUCH DAMAGE.
35.\"
36.Dd June 30, 2014
37.Dt RE_FORMAT 7
38.Os
39.Sh NAME
40.Nm re_format
41.Nd POSIX 1003.2 regular expressions
42.Sh DESCRIPTION
43Regular expressions
44.Pq Dq RE Ns s ,
45as defined in
46.St -p1003.2 ,
47come in two forms:
48modern REs (roughly those of
49.Xr egrep 1 ;
501003.2 calls these
51.Dq extended
52REs)
53and obsolete REs (roughly those of
54.Xr ed 1 ;
551003.2
56.Dq basic
57REs).
58Obsolete REs mostly exist for backward compatibility in some old programs;
59they will be discussed at the end.
60.St -p1003.2
61leaves some aspects of RE syntax and semantics open;
62`\(dd' marks decisions on these aspects that
63may not be fully portable to other
64.St -p1003.2
65implementations.
66.Pp
67A (modern) RE is one\(dd or more non-empty\(dd
68.Em branches ,
69separated by
70.Ql \&| .
71It matches anything that matches one of the branches.
72.Pp
73A branch is one\(dd or more
74.Em pieces ,
75concatenated.
76It matches a match for the first, followed by a match for the second, etc.
77.Pp
78A piece is an
79.Em atom
80possibly followed
81by a single\(dd
82.Ql \&* ,
83.Ql \&+ ,
84.Ql \&? ,
85or
86.Em bound .
87An atom followed by
88.Ql \&*
89matches a sequence of 0 or more matches of the atom.
90An atom followed by
91.Ql \&+
92matches a sequence of 1 or more matches of the atom.
93An atom followed by
94.Ql ?\&
95matches a sequence of 0 or 1 matches of the atom.
96.Pp
97A
98.Em bound
99is
100.Ql \&{
101followed by an unsigned decimal integer,
102possibly followed by
103.Ql \&,
104possibly followed by another unsigned decimal integer,
105always followed by
106.Ql \&} .
107The integers must lie between 0 and
108.Dv RE_DUP_MAX
109(255\(dd) inclusive,
110and if there are two of them, the first may not exceed the second.
111An atom followed by a bound containing one integer
112.Em i
113and no comma matches
114a sequence of exactly
115.Em i
116matches of the atom.
117An atom followed by a bound
118containing one integer
119.Em i
120and a comma matches
121a sequence of
122.Em i
123or more matches of the atom.
124An atom followed by a bound
125containing two integers
126.Em i
127and
128.Em j
129matches
130a sequence of
131.Em i
132through
133.Em j
134(inclusive) matches of the atom.
135.Pp
136An atom is a regular expression enclosed in
137.Ql ()
138(matching a match for the
139regular expression),
140an empty set of
141.Ql ()
142(matching the null string)\(dd,
143a
144.Em bracket expression
145(see below),
146.Ql .\&
147(matching any single character),
148.Ql \&^
149(matching the null string at the beginning of a line),
150.Ql \&$
151(matching the null string at the end of a line), a
152.Ql \e
153followed by one of the characters
154.Ql ^.[$()|*+?{\e
155(matching that character taken as an ordinary character),
156a
157.Ql \e
158followed by any other character\(dd
159(matching that character taken as an ordinary character,
160as if the
161.Ql \e
162had not been present\(dd),
163or a single character with no other significance (matching that character).
164A
165.Ql \&{
166followed by a character other than a digit is an ordinary
167character, not the beginning of a bound\(dd.
168It is illegal to end an RE with
169.Ql \e .
170.Pp
171A
172.Em bracket expression
173is a list of characters enclosed in
174.Ql [] .
175It normally matches any single character from the list (but see below).
176If the list begins with
177.Ql \&^ ,
178it matches any single character
179(but see below)
180.Em not
181from the rest of the list.
182If two characters in the list are separated by
183.Ql \&- ,
184this is shorthand
185for the full
186.Em range
187of characters between those two (inclusive) in the
188collating sequence,
189.No e.g. Ql [0-9]
190in ASCII matches any decimal digit.
191It is illegal\(dd for two ranges to share an
192endpoint,
193.No e.g. Ql a-c-e .
194Ranges are very collating-sequence-dependent,
195and portable programs should avoid relying on them.
196.Pp
197To include a literal
198.Ql \&]
199in the list, make it the first character
200(following a possible
201.Ql \&^ ) .
202To include a literal
203.Ql \&- ,
204make it the first or last character,
205or the second endpoint of a range.
206To use a literal
207.Ql \&-
208as the first endpoint of a range,
209enclose it in
210.Ql [.\&
211and
212.Ql .]\&
213to make it a collating element (see below).
214With the exception of these and some combinations using
215.Ql \&[
216(see next paragraphs), all other special characters, including
217.Ql \e ,
218lose their special significance within a bracket expression.
219.Pp
220Within a bracket expression, a collating element (a character,
221a multi-character sequence that collates as if it were a single character,
222or a collating-sequence name for either)
223enclosed in
224.Ql [.\&
225and
226.Ql .]\&
227stands for the
228sequence of characters of that collating element.
229The sequence is a single element of the bracket expression's list.
230A bracket expression containing a multi-character collating element
231can thus match more than one character,
232e.g.\& if the collating sequence includes a
233.Ql ch
234collating element,
235then the RE
236.Ql [[.ch.]]*c
237matches the first five characters
238of
239.Ql chchcc .
240.Pp
241Within a bracket expression, a collating element enclosed in
242.Ql [=
243and
244.Ql =]
245is an equivalence class, standing for the sequences of characters
246of all collating elements equivalent to that one, including itself.
247(If there are no other equivalent collating elements,
248the treatment is as if the enclosing delimiters were
249.Ql [.\&
250and
251.Ql .] . )
252For example, if
253.Ql x
254and
255.Ql y
256are the members of an equivalence class,
257then
258.Ql [[=x=]] ,
259.Ql [[=y=]] ,
260and
261.Ql [xy]
262are all synonymous.
263An equivalence class may not\(dd be an endpoint
264of a range.
265.Pp
266Within a bracket expression, the name of a
267.Em character class
268enclosed in
269.Ql [:
270and
271.Ql :]
272stands for the list of all characters belonging to that
273class.
274Standard character class names are:
275.Bl -column "alnum" "digit" "xdigit" -offset indent
276.It Em "alnum	digit	punct"
277.It Em "alpha	graph	space"
278.It Em "blank	lower	upper"
279.It Em "cntrl	print	xdigit"
280.El
281.Pp
282These stand for the character classes defined in
283.Xr ctype 3 .
284A locale may provide others.
285A character class may not be used as an endpoint of a range.
286.Pp
287A bracketed expression like
288.Ql [[:class:]]
289can be used to match a single character that belongs to a character
290class.
291The reverse, matching any character that does not belong to a specific
292class, the negation operator of bracket expressions may be used:
293.Ql [^[:class:]] .
294.Pp
295There are two special cases\(dd of bracket expressions:
296the bracket expressions
297.Ql [[:<:]]
298and
299.Ql [[:>:]]
300match the null string at the beginning and end of a word respectively.
301A word is defined as a sequence of word characters
302which is neither preceded nor followed by
303word characters.
304A word character is an
305.Em alnum
306character (as defined by
307.Xr ctype 3 )
308or an underscore.
309This is an extension,
310compatible with but not specified by
311.St -p1003.2 ,
312and should be used with
313caution in software intended to be portable to other systems.
314The additional word delimiters
315.Ql \e<
316and
317.Ql \e>
318are provided to ease compatibility with traditional
319SVR4
320systems but are not portable and should be avoided.
321.Pp
322In the event that an RE could match more than one substring of a given
323string,
324the RE matches the one starting earliest in the string.
325If the RE could match more than one substring starting at that point,
326it matches the longest.
327Subexpressions also match the longest possible substrings, subject to
328the constraint that the whole match be as long as possible,
329with subexpressions starting earlier in the RE taking priority over
330ones starting later.
331Note that higher-level subexpressions thus take priority over
332their lower-level component subexpressions.
333.Pp
334Match lengths are measured in characters, not collating elements.
335A null string is considered longer than no match at all.
336For example,
337.Ql bb*
338matches the three middle characters of
339.Ql abbbc ,
340.Ql (wee|week)(knights|nights)
341matches all ten characters of
342.Ql weeknights ,
343when
344.Ql (.*).*\&
345is matched against
346.Ql abc
347the parenthesized subexpression
348matches all three characters, and
349when
350.Ql (a*)*
351is matched against
352.Ql bc
353both the whole RE and the parenthesized
354subexpression match the null string.
355.Pp
356If case-independent matching is specified,
357the effect is much as if all case distinctions had vanished from the
358alphabet.
359When an alphabetic that exists in multiple cases appears as an
360ordinary character outside a bracket expression, it is effectively
361transformed into a bracket expression containing both cases,
362.No e.g. Ql x
363becomes
364.Ql [xX] .
365When it appears inside a bracket expression, all case counterparts
366of it are added to the bracket expression, so that (e.g.)
367.Ql [x]
368becomes
369.Ql [xX]
370and
371.Ql [^x]
372becomes
373.Ql [^xX] .
374.Pp
375No particular limit is imposed on the length of REs\(dd.
376Programs intended to be portable should not employ REs longer
377than 256 bytes,
378as an implementation can refuse to accept such REs and remain
379POSIX-compliant.
380.Pp
381Obsolete
382.Pq Dq basic
383regular expressions differ in several respects.
384.Ql \&|
385is an ordinary character and there is no equivalent
386for its functionality.
387.Ql \&+
388and
389.Ql ?\&
390are ordinary characters, and their functionality
391can be expressed using bounds
392.Po
393.Ql {1,}
394or
395.Ql {0,1}
396respectively
397.Pc .
398Also note that
399.Ql x+
400in modern REs is equivalent to
401.Ql xx* .
402The delimiters for bounds are
403.Ql \e{
404and
405.Ql \e} ,
406with
407.Ql \&{
408and
409.Ql \&}
410by themselves ordinary characters.
411The parentheses for nested subexpressions are
412.Ql \e(
413and
414.Ql \e) ,
415with
416.Ql \&(
417and
418.Ql \&)
419by themselves ordinary characters.
420.Ql \&^
421is an ordinary character except at the beginning of the
422RE or\(dd the beginning of a parenthesized subexpression,
423.Ql \&$
424is an ordinary character except at the end of the
425RE or\(dd the end of a parenthesized subexpression,
426and
427.Ql \&*
428is an ordinary character if it appears at the beginning of the
429RE or the beginning of a parenthesized subexpression
430(after a possible leading
431.Ql \&^ ) .
432Finally, there is one new type of atom, a
433.Em back reference :
434.Ql \e
435followed by a non-zero decimal digit
436.Em d
437matches the same sequence of characters
438matched by the
439.Em d Ns th
440parenthesized subexpression
441(numbering subexpressions by the positions of their opening parentheses,
442left to right),
443so that (e.g.)
444.Ql \e([bc]\e)\e1
445matches
446.Ql bb
447or
448.Ql cc
449but not
450.Ql bc .
451.Sh SEE ALSO
452.Xr regex 3
453.Rs
454.%T Regular Expression Notation
455.%R IEEE Std
456.%N 1003.2
457.%P section 2.8
458.Re
459.Sh BUGS
460Having two kinds of REs is a botch.
461.Pp
462The current
463.St -p1003.2
464spec says that
465.Ql \&)
466is an ordinary character in
467the absence of an unmatched
468.Ql \&( ;
469this was an unintentional result of a wording error,
470and change is likely.
471Avoid relying on it.
472.Pp
473Back references are a dreadful botch,
474posing major problems for efficient implementations.
475They are also somewhat vaguely defined
476(does
477.Ql a\e(\e(b\e)*\e2\e)*d
478match
479.Ql abbbd ? ) .
480Avoid using them.
481.Pp
482.St -p1003.2
483specification of case-independent matching is vague.
484The
485.Dq one case implies all cases
486definition given above
487is current consensus among implementors as to the right interpretation.
488.Pp
489The syntax for word boundaries is incredibly ugly.
490