1.\" Copyright (c) 1992, 1993, 1994 Henry Spencer. 2.\" Copyright (c) 1992, 1993, 1994 3.\" The Regents of the University of California. All rights reserved. 4.\" 5.\" This code is derived from software contributed to Berkeley by 6.\" Henry Spencer. 7.\" 8.\" Redistribution and use in source and binary forms, with or without 9.\" modification, are permitted provided that the following conditions 10.\" are met: 11.\" 1. Redistributions of source code must retain the above copyright 12.\" notice, this list of conditions and the following disclaimer. 13.\" 2. Redistributions in binary form must reproduce the above copyright 14.\" notice, this list of conditions and the following disclaimer in the 15.\" documentation and/or other materials provided with the distribution. 16.\" 3. Neither the name of the University nor the names of its contributors 17.\" may be used to endorse or promote products derived from this software 18.\" without specific prior written permission. 19.\" 20.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 21.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 23.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 24.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 30.\" SUCH DAMAGE. 31.\" 32.\" @(#)re_format.7 8.3 (Berkeley) 3/20/94 33.\" $FreeBSD: src/lib/libc/regex/re_format.7,v 1.12 2008/09/05 17:41:20 keramida Exp $ 34.\" 35.Dd August 6, 2015 36.Dt RE_FORMAT 7 37.Os 38.Sh NAME 39.Nm re_format 40.Nd POSIX 1003.2 regular expressions 41.Sh DESCRIPTION 42Regular expressions 43.Pq Dq RE Ns s , 44as defined in 45.St -p1003.2 , 46come in two forms: 47modern REs (roughly those of 48.Xr egrep 1 ; 491003.2 calls these 50.Dq extended 51REs) 52and obsolete REs (roughly those of 53.Xr ed 1 ; 541003.2 55.Dq basic 56REs). 57Obsolete REs mostly exist for backward compatibility in some old programs; 58they will be discussed at the end. 59.St -p1003.2 60leaves some aspects of RE syntax and semantics open; 61`\(dd' marks decisions on these aspects that 62may not be fully portable to other 63.St -p1003.2 64implementations. 65.Pp 66A (modern) RE is one\(dd or more non-empty\(dd 67.Em branches , 68separated by 69.Ql \&| . 70It matches anything that matches one of the branches. 71.Pp 72A branch is one\(dd or more 73.Em pieces , 74concatenated. 75It matches a match for the first, followed by a match for the second, etc. 76.Pp 77A piece is an 78.Em atom 79possibly followed 80by a single\(dd 81.Ql \&* , 82.Ql \&+ , 83.Ql \&? , 84or 85.Em bound . 86An atom followed by 87.Ql \&* 88matches a sequence of 0 or more matches of the atom. 89An atom followed by 90.Ql \&+ 91matches a sequence of 1 or more matches of the atom. 92An atom followed by 93.Ql ?\& 94matches a sequence of 0 or 1 matches of the atom. 95.Pp 96A 97.Em bound 98is 99.Ql \&{ 100followed by an unsigned decimal integer, 101possibly followed by 102.Ql \&, 103possibly followed by another unsigned decimal integer, 104always followed by 105.Ql \&} . 106The integers must lie between 0 and 107.Dv RE_DUP_MAX 108(255\(dd) inclusive, 109and if there are two of them, the first may not exceed the second. 110An atom followed by a bound containing one integer 111.Em i 112and no comma matches 113a sequence of exactly 114.Em i 115matches of the atom. 116An atom followed by a bound 117containing one integer 118.Em i 119and a comma matches 120a sequence of 121.Em i 122or more matches of the atom. 123An atom followed by a bound 124containing two integers 125.Em i 126and 127.Em j 128matches 129a sequence of 130.Em i 131through 132.Em j 133(inclusive) matches of the atom. 134.Pp 135An atom is a regular expression enclosed in 136.Ql () 137(matching a match for the 138regular expression), 139an empty set of 140.Ql () 141(matching the null string)\(dd, 142a 143.Em bracket expression 144(see below), 145.Ql .\& 146(matching any single character), 147.Ql \&^ 148(matching the null string at the beginning of a line), 149.Ql \&$ 150(matching the null string at the end of a line), a 151.Ql \e 152followed by one of the characters 153.Ql ^.[$()|*+?{\e 154(matching that character taken as an ordinary character), 155a 156.Ql \e 157followed by any other character\(dd 158(matching that character taken as an ordinary character, 159as if the 160.Ql \e 161had not been present\(dd), 162or a single character with no other significance (matching that character). 163A 164.Ql \&{ 165followed by a character other than a digit is an ordinary 166character, not the beginning of a bound\(dd. 167It is illegal to end an RE with 168.Ql \e . 169.Pp 170A 171.Em bracket expression 172is a list of characters enclosed in 173.Ql [] . 174It normally matches any single character from the list (but see below). 175If the list begins with 176.Ql \&^ , 177it matches any single character 178(but see below) 179.Em not 180from the rest of the list. 181If two characters in the list are separated by 182.Ql \&- , 183this is shorthand 184for the full 185.Em range 186of characters between those two (inclusive) in the 187collating sequence, 188.No e.g. Ql [0-9] 189in ASCII matches any decimal digit. 190It is illegal\(dd for two ranges to share an 191endpoint, 192.No e.g. Ql a-c-e . 193Ranges are very collating-sequence-dependent, 194and portable programs should avoid relying on them. 195.Pp 196To include a literal 197.Ql \&] 198in the list, make it the first character 199(following a possible 200.Ql \&^ ) . 201To include a literal 202.Ql \&- , 203make it the first or last character, 204or the second endpoint of a range. 205To use a literal 206.Ql \&- 207as the first endpoint of a range, 208enclose it in 209.Ql [.\& 210and 211.Ql .]\& 212to make it a collating element (see below). 213With the exception of these and some combinations using 214.Ql \&[ 215(see next paragraphs), all other special characters, including 216.Ql \e , 217lose their special significance within a bracket expression. 218.Pp 219Within a bracket expression, a collating element (a character, 220a multi-character sequence that collates as if it were a single character, 221or a collating-sequence name for either) 222enclosed in 223.Ql [.\& 224and 225.Ql .]\& 226stands for the 227sequence of characters of that collating element. 228The sequence is a single element of the bracket expression's list. 229A bracket expression containing a multi-character collating element 230can thus match more than one character, 231e.g.\& if the collating sequence includes a 232.Ql ch 233collating element, 234then the RE 235.Ql [[.ch.]]*c 236matches the first five characters 237of 238.Ql chchcc . 239.Pp 240Within a bracket expression, a collating element enclosed in 241.Ql [= 242and 243.Ql =] 244is an equivalence class, standing for the sequences of characters 245of all collating elements equivalent to that one, including itself. 246(If there are no other equivalent collating elements, 247the treatment is as if the enclosing delimiters were 248.Ql [.\& 249and 250.Ql .] . ) 251For example, if 252.Ql x 253and 254.Ql y 255are the members of an equivalence class, 256then 257.Ql [[=x=]] , 258.Ql [[=y=]] , 259and 260.Ql [xy] 261are all synonymous. 262An equivalence class may not\(dd be an endpoint 263of a range. 264.Pp 265Within a bracket expression, the name of a 266.Em character class 267enclosed in 268.Ql [: 269and 270.Ql :] 271stands for the list of all characters belonging to that 272class. 273Standard character class names are: 274.Bl -column "alnum" "digit" "xdigit" -offset indent 275.It Em "alnum digit punct" 276.It Em "alpha graph space" 277.It Em "blank lower upper" 278.It Em "cntrl print xdigit" 279.El 280.Pp 281These stand for the character classes defined in 282.Xr ctype 3 . 283A locale may provide others. 284A character class may not be used as an endpoint of a range. 285.Pp 286A bracketed expression like 287.Ql [[:class:]] 288can be used to match a single character that belongs to a character 289class. 290The reverse, matching any character that does not belong to a specific 291class, the negation operator of bracket expressions may be used: 292.Ql [^[:class:]] . 293.Pp 294There are two special cases\(dd of bracket expressions: 295the bracket expressions 296.Ql [[:<:]] 297and 298.Ql [[:>:]] 299match the null string at the beginning and end of a word respectively. 300A word is defined as a sequence of word characters 301which is neither preceded nor followed by 302word characters. 303A word character is an 304.Em alnum 305character (as defined by 306.Xr ctype 3 ) 307or an underscore. 308This is an extension, 309compatible with but not specified by 310.St -p1003.2 , 311and should be used with 312caution in software intended to be portable to other systems. 313.Pp 314In the event that an RE could match more than one substring of a given 315string, 316the RE matches the one starting earliest in the string. 317If the RE could match more than one substring starting at that point, 318it matches the longest. 319Subexpressions also match the longest possible substrings, subject to 320the constraint that the whole match be as long as possible, 321with subexpressions starting earlier in the RE taking priority over 322ones starting later. 323Note that higher-level subexpressions thus take priority over 324their lower-level component subexpressions. 325.Pp 326Match lengths are measured in characters, not collating elements. 327A null string is considered longer than no match at all. 328For example, 329.Ql bb* 330matches the three middle characters of 331.Ql abbbc , 332.Ql (wee|week)(knights|nights) 333matches all ten characters of 334.Ql weeknights , 335when 336.Ql (.*).*\& 337is matched against 338.Ql abc 339the parenthesized subexpression 340matches all three characters, and 341when 342.Ql (a*)* 343is matched against 344.Ql bc 345both the whole RE and the parenthesized 346subexpression match the null string. 347.Pp 348If case-independent matching is specified, 349the effect is much as if all case distinctions had vanished from the 350alphabet. 351When an alphabetic that exists in multiple cases appears as an 352ordinary character outside a bracket expression, it is effectively 353transformed into a bracket expression containing both cases, 354.No e.g. Ql x 355becomes 356.Ql [xX] . 357When it appears inside a bracket expression, all case counterparts 358of it are added to the bracket expression, so that (e.g.) 359.Ql [x] 360becomes 361.Ql [xX] 362and 363.Ql [^x] 364becomes 365.Ql [^xX] . 366.Pp 367No particular limit is imposed on the length of REs\(dd. 368Programs intended to be portable should not employ REs longer 369than 256 bytes, 370as an implementation can refuse to accept such REs and remain 371POSIX-compliant. 372.Pp 373Obsolete 374.Pq Dq basic 375regular expressions differ in several respects. 376.Ql \&| 377is an ordinary character and there is no equivalent 378for its functionality. 379.Ql \&+ 380and 381.Ql ?\& 382are ordinary characters, and their functionality 383can be expressed using bounds 384.No ( Ql {1,} 385or 386.Ql {0,1} 387respectively). 388Also note that 389.Ql x+ 390in modern REs is equivalent to 391.Ql xx* . 392The delimiters for bounds are 393.Ql \e{ 394and 395.Ql \e} , 396with 397.Ql \&{ 398and 399.Ql \&} 400by themselves ordinary characters. 401The parentheses for nested subexpressions are 402.Ql \e( 403and 404.Ql \e) , 405with 406.Ql \&( 407and 408.Ql \&) 409by themselves ordinary characters. 410.Ql \&^ 411is an ordinary character except at the beginning of the 412RE or\(dd the beginning of a parenthesized subexpression, 413.Ql \&$ 414is an ordinary character except at the end of the 415RE or\(dd the end of a parenthesized subexpression, 416and 417.Ql \&* 418is an ordinary character if it appears at the beginning of the 419RE or the beginning of a parenthesized subexpression 420(after a possible leading 421.Ql \&^ ) . 422Finally, there is one new type of atom, a 423.Em back reference : 424.Ql \e 425followed by a non-zero decimal digit 426.Em d 427matches the same sequence of characters 428matched by the 429.Em d Ns th 430parenthesized subexpression 431(numbering subexpressions by the positions of their opening parentheses, 432left to right), 433so that (e.g.) 434.Ql \e([bc]\e)\e1 435matches 436.Ql bb 437or 438.Ql cc 439but not 440.Ql bc . 441.Sh ENHANCED FEATURES 442When the 443.Dv REG_ENHANCED 444flag is passed to one of the 445.Fn regcomp 446variants, additional features are activated. 447Like the enhanced 448.Nm regex 449implementations in scripting languages such as 450.Xr perl 1 451and 452.Xr python 1 , 453these additional features may conflict with the 454.St -p1003.2 455standards in some ways. 456Use this with care in situations which require portability 457(including to past versions of the Mac OS X using the previous 458.Nm regex 459implementation). 460.Pp 461For enhanced basic REs, 462.Ql \&+ , 463.Ql \&? 464and 465.Ql \&| 466remain regular characters, but 467.Ql \e+ , 468.Ql \e? 469and 470.Ql \e| 471have the same special meaning as the unescaped characters do for 472extended REs, i.e., one or more matches, zero or one matches and alteration, 473respectively. 474For enhanced extended REs, 475back references are available. 476Additional enhanced features are listed below. 477.Pp 478Within a bracket expression, most characters lose their magic. 479This also applies to the additional enhanced features, which don't operate 480inside a bracket expression. 481.Ss Assertions (available for both enhanced basic and enhanced extended REs) 482In addition to 483.Ql \&^ 484and 485.Ql \&$ 486(the assertions that match the null string at the beginning and end of line, 487respectively), the following assertions become available: 488.Bl -tag -width ".Sy \eB" -offset indent 489.It Sy \e< 490Matches the null string at the beginning of a word. 491This is equivalent to 492.Ql [[:<:]] . 493.It Sy \e> 494Matches the null string at the end of a word. 495This is equivalent to 496.Ql [[:>:]] . 497.It Sy \eb 498Matches the null string at a word boundary (either the beginning or end of 499a word). 500.It Sy \eB 501Matches the null string where there is no word boundary. 502This is the opposite of 503.Ql \eb . 504.El 505.Ss Shortcuts (available for both enhanced basic and enhanced extended REs) 506The following shortcuts can be used to replace more complicated 507bracket expressions. 508.Bl -tag -width ".Sy \eD" -offset indent 509.It Sy \ed 510Matches a digit character. 511This is equivalent to 512.Ql [[:digit:]] . 513.It Sy \eD 514Matches a non-digit character. 515This is equivalent to 516.Ql [^[:digit:]] . 517.It Sy \es 518Matches a space character. 519This is equivalent to 520.Ql [[:space:]] . 521.It Sy \eS 522Matches a non-space character. 523This is equivalent to 524.Ql [^[:space:]] . 525.It Sy \ew 526Matches a word character. 527This is equivalent to 528.Ql [[:alnum:]_] . 529.It Sy \eW 530Matches a non-word character. 531This is equivalent to 532.Ql [^[:alnum:]_] . 533.El 534.Ss Literal Sequences (available for both enhanced basic and enhanced extended REs) 535Literals are normally just ordinary characters that are matched directly. 536Under enhanced mode, certain character sequences are 537converted to specific literals. 538.Bl -tag -width ".Sy \ea" -offset indent 539.It Sy \ea 540The 541.Dq bell 542character (ASCII code 7). 543.It Sy \ee 544The 545.Dq escape 546character (ASCII code 27). 547.It Sy \ef 548The 549.Dq form-feed 550character (ASCII code 12). 551.It Sy \en 552The 553.Dq new-line/line-feed 554character (ASCII code 10). 555.It Sy \er 556The 557.Dq carriage-return 558character (ASCII code 13). 559.It Sy \et 560The 561.Dq horizontal-tab 562character (ASCII code 9). 563.El 564.Pp 565Literals can also be specified directly, using their wide character values. 566Note that when matching a multibyte character string, the string's bytes 567are converted to wide character before comparing. 568This means that a single literal wide character value may match more than 569one string byte, depending on the locale's wide character encoding. 570.Bl -tag -width ".Sy \ex{ Ns Em x.. Ns Sy \&}" -offset indent 571.It Sy \ex Ns Em x.. 572An arbitray eight-bit value. 573The 574.Em x.. 575sequence represents zero, one or two hexadecimal digits. 576(Note: if 577.Em x.. 578is less than two hexadecimal digits, and the character following this sequence 579happens to be a hexadecimal digit, use the (following) brace form to avoid 580confusion.) 581.It Sy \ex{ Ns Em x.. Ns Sy \&} 582An arbitrary, up to 32-bit value. 583The 584.Em x.. 585sequence is an arbitrary sequence of hexadecimal digits that is long enough 586to represent the necessary value. 587.El 588.Ss Inline Literal Mode (available for both enhanced basic and enhanced extended REs) 589A 590.Ql \eQ 591sequence causes literal 592.Pq Dq quote 593mode to be entered, 594while 595.Ql \eE 596ends literal mode, and returns to normal regular expression processing. 597This is similar to specifying the 598.Dv REG_NOSPEC 599(or 600.Dv REG_LITERAL ) 601option to 602.Fn regcomp , 603except that rather than applying to the whole RE string, it only applies to 604the part between the 605.Ql \eQ 606and 607.Ql \eE . 608Note that it is not possible to have a 609.Ql \eE 610in the middle of an inline literal range, as that would terminate literal mode 611prematurely. 612.Ss Minimal Repetitions (available for enhanced extended REs only) 613By default, the repetition operators, 614.Ql \&* , 615.Em bound , 616.Ql \&? 617and 618.Ql \&+ 619are 620.Em greedy ; 621they try to match as many times as possible. 622In enhanced mode, appending a 623.Ql \&? 624to a repetition operator makes it minimal (or 625.Em ungreedy ) ; 626it tries to match the fewest number of times (including zero times, as 627appropriate). 628.Pp 629For example, against the string 630.Ql aaa , 631the RE 632.Ql a* 633would match the entire string, 634while 635.Ql a*? 636would match the null string at the beginning of the line 637(matches zero times). 638Likewise, against the string 639.Ql ababab , 640the RE 641.Ql .*b , 642would also match the entire string, 643while 644.Ql .*?b 645would only match the first two characters. 646.Pp 647The 648.Fn regcomp 649flag 650.Dv REG_UNGREEDY 651will make the regular 652.Pq greedy 653repetition operators ungreedy by default. 654Appending 655.Ql \&? 656makes them greedy again. 657.Pp 658Note that minimal repetitions are not specified by an official 659standard, so there may be differences between different implementations. 660In the current implementation, minimal repetitions have a high precedence, 661and can cause other standards requirements to be violated. 662For instance, on the string 663.Ql aaaaa , 664the RE 665.Ql (aaa??)* 666will only match the first four characters, violating the rules that the longest 667possible match is made and the longest subexpressions are matched. 668Using 669.Ql (aaa??)*$ 670forces the entire string to be matched. 671.Ss Non-capturing Parenthesized Subexpressions (available for enhanced extended REs only) 672Normally, the match offsets to parenthesized subexpressions are 673recorded in the 674.Fa pmatch 675array (that is, when 676.Dv REG_NOSUB 677is not specified, and 678.Fa nmatch 679is large enough to encompass the parenthesized subexpression in question). 680In enhanced mode, if the first two characters following the left parenthesis 681are 682.Ql ?: , 683grouping of the remaining contents is done, but the corresponding offsets are 684not recorded in the 685.Fa pmatch 686array. 687For example, against the string 688.Ql fubar , 689the RE 690.Ql (fu)(bar) 691would have two subexpression matches in 692.Fa pmatch ; 693the first for 694.Ql fu 695and the second for 696.Ql bar . 697But with the RE 698.Ql (?:fu)(bar) , 699there would only be one subexpression match, that of 700.Ql bar . 701Furthermore, 702against the string 703.Ql fufubar , 704the RE 705.Ql (?fu)*(bar) 706would again match the entire string, but only 707.Ql bar 708would be recorded in 709.Fa pmatch . 710.Ss Inline Options (available for enhanced extended REs only) 711Like the inline literal mode mentioned above, other options can be switched 712on and off for part of a RE. 713.Ql (? Ns Em o.. Ns \&) 714will turn on the options specified in 715.Em o.. 716(one or more options characters; see below), while 717.Ql (?- Ns Em o.. Ns \&) 718will turn off the specified options, and 719.Ql (? Ns Em o1.. Ns \&- Ns Em o2.. Ns \&) 720will turn on the first set of options, and turn off the second set. 721.Pp 722The available options are: 723.Bl -tag -width ".Sy \&U" -offset indent 724.It Sy \&i 725Turning on this option will ignore case during matching, while turning off 726will restore case-sensitive matching. 727If 728.Dv REG_ICASE 729was specified to 730.Fn regcomp , 731this option can be used to turn that off. 732.It Sy \&n 733Turn on or off special handling of the newline character. 734If 735.Dv REG_NEWLINE 736was specified to 737.Fn regcomp , 738this option can be used to turn that off. 739.It Sy \&U 740Turning on this option will make ungreedy repetitions the default, while 741turning off will make greedy repetitions the default. 742If 743.Dv REG_UNGREEDY 744was specified to 745.Fn regcomp , 746this option can be used to turn that off. 747.El 748.Pp 749The scope of the option change begins immediately following the right 750parenthesis, 751but up to the end of the enclosing subexpression (if any). 752Thus, for example, given the RE 753.Ql (fu(?i)bar)baz , 754the 755.Ql fu 756portion matches case sensitively, 757.Ql bar 758matches case insensitively, and 759.Ql baz 760matches case sensitively again (since is it outside the scope of the 761subexpression in which the inline option was specified). 762.Pp 763The inline options syntax can be combined with the non-capturing parenthesized 764subexpression to limit the option scope to just that of the subexpression. 765Then, for example, 766.Ql fu(?i:bar)baz 767is similar to the previous example, except for the parenthesize subexpression 768around 769.Ql fu(?i)bar 770in the previous example. 771.Ss Inline Comments (available for enhanced extended REs only) 772The syntax 773.Ql (?# Ns Em comment Ns \&) 774can be used to embed comments within a RE. 775Note that 776.Em comment 777can not contain a right parenthesis. 778Also note that while syntactically, option characters can be added before 779the 780.Ql \&# 781character, they will be ignored. 782.Sh SEE ALSO 783.Xr regex 3 784.Rs 785.%T Regular Expression Notation 786.%R IEEE Std 787.%N 1003.2 788.%P section 2.8 789.Re 790.Sh BUGS 791Having two kinds of REs is a botch. 792.Pp 793The current 794.St -p1003.2 795spec says that 796.Ql \&) 797is an ordinary character in 798the absence of an unmatched 799.Ql \&( ; 800this was an unintentional result of a wording error, 801and change is likely. 802Avoid relying on it. 803.Pp 804Back references are a dreadful botch, 805posing major problems for efficient implementations. 806They are also somewhat vaguely defined 807(does 808.Ql a\e(\e(b\e)*\e2\e)*d 809match 810.Ql abbbd ? ) . 811Avoid using them. 812.Pp 813.St -p1003.2 814specification of case-independent matching is vague. 815The 816.Dq one case implies all cases 817definition given above 818is current consensus among implementors as to the right interpretation. 819.Pp 820The bracket syntax for word boundaries is incredibly ugly. 821