1.\" $NetBSD: regex.3,v 1.14 2002/10/01 17:06:53 wiz Exp $ 2.\" 3.\" Copyright (c) 1992, 1993, 1994 Henry Spencer. 4.\" Copyright (c) 1992, 1993, 1994 5.\" The Regents of the University of California. All rights reserved. 6.\" 7.\" This code is derived from software contributed to Berkeley by 8.\" Henry Spencer. 9.\" 10.\" Redistribution and use in source and binary forms, with or without 11.\" modification, are permitted provided that the following conditions 12.\" are met: 13.\" 1. Redistributions of source code must retain the above copyright 14.\" notice, this list of conditions and the following disclaimer. 15.\" 2. Redistributions in binary form must reproduce the above copyright 16.\" notice, this list of conditions and the following disclaimer in the 17.\" documentation and/or other materials provided with the distribution. 18.\" 3. All advertising materials mentioning features or use of this software 19.\" must display the following acknowledgement: 20.\" This product includes software developed by the University of 21.\" California, Berkeley and its contributors. 22.\" 4. Neither the name of the University nor the names of its contributors 23.\" may be used to endorse or promote products derived from this software 24.\" without specific prior written permission. 25.\" 26.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 27.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 28.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 29.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 30.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 31.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 32.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 33.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 34.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 35.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 36.\" SUCH DAMAGE. 37.\" 38.\" @(#)regex.3 8.4 (Berkeley) 3/20/94 39.\" 40.Dd March 20, 1994 41.Dt REGEX 3 42.Os 43.Sh NAME 44.Nm regex , 45.Nm regcomp , 46.Nm regexec , 47.Nm regerror , 48.Nm regfree 49.Nd regular-expression library 50.Sh LIBRARY 51.Lb libc 52.Sh SYNOPSIS 53.Fd #include \*[Lt]regex.h\*[Gt] 54.Ft int 55.Fn regcomp "regex_t *preg" "const char *pattern" "int cflags" 56.Ft int 57.Fn regexec "const regex_t *preg" "const char *string" "size_t nmatch" "regmatch_t pmatch[]" "int eflags" 58.Ft size_t 59.Fn regerror "int errcode" "const regex_t *preg" "char *errbuf" "size_t errbuf_size" 60.Ft void 61.Fn regfree "regex_t *preg" 62.Sh DESCRIPTION 63These routines implement 64.St -p1003.2-92 65regular expressions (``RE''s); 66see 67.Xr re_format 7 . 68.Fn regcomp 69compiles an RE written as a string into an internal form, 70.Fn regexec 71matches that internal form against a string and reports results, 72.Fn regerror 73transforms error codes from either into human-readable messages, 74and 75.Fn regfree 76frees any dynamically-allocated storage used by the internal form 77of an RE. 78.Pp 79The header 80.Em \*[Lt]regex.h\*[Gt] 81declares two structure types, 82.Fa regex_t 83and 84.Fa regmatch_t , 85the former for compiled internal forms and the latter for match reporting. 86It also declares the four functions, 87a type 88.Fa regoff_t , 89and a number of constants with names starting with ``REG_''. 90.Pp 91.Fn regcomp 92compiles the regular expression contained in the 93.Fa pattern 94string, 95subject to the flags in 96.Fa cflags , 97and places the results in the 98.Fa regex_t 99structure pointed to by 100.Fa preg . 101.Fa cflags 102is the bitwise OR of zero or more of the following flags: 103.Bl -tag -width XXXREG_EXTENDED 104.It Dv REG_EXTENDED 105Compile modern (``extended'') REs, rather than the obsolete 106(``basic'') REs that are the default. 107.It Dv REG_BASIC 108This is a synonym for 0, 109provided as a counterpart to REG_EXTENDED to improve readability. 110.It Dv REG_NOSPEC 111Compile with recognition of all special characters turned off. 112All characters are thus considered ordinary, so the ``RE'' is a literal 113string. 114This is an extension, compatible with but not specified by 115.St -p1003.2-92 , 116and should be used with caution in software intended to be portable to 117other systems. 118.Dv REG_EXTENDED 119and 120.Dv REG_NOSPEC 121may not be used in the same call to 122.Fn regcomp . 123.It Dv REG_ICASE 124Compile for matching that ignores upper/lower case distinctions. 125See 126.Xr re_format 7 . 127.It Dv REG_NOSUB 128Compile for matching that need only report success or failure, not 129what was matched. 130.It Dv REG_NEWLINE 131Compile for newline-sensitive matching. 132By default, newline is a completely ordinary character with no special 133meaning in either REs or strings. 134With this flag, 135`[^' bracket expressions and `.' never match newline, 136a `^' anchor matches the null string after any newline in the string 137in addition to its normal function, 138and the `$' anchor matches the null string before any newline in the 139string in addition to its normal function. 140.It Dv REG_PEND 141The regular expression ends, not at the first NUL, but just before the 142character pointed to by the 143.Fa re_endp 144member of the structure pointed to by 145.Fa preg . 146The 147.Fa re_endp 148member is of type 149.Fa "const\ char\ *" . 150This flag permits inclusion of NULs in the RE; they are considered 151ordinary characters. 152This is an extension, compatible with but not specified by 153.St -p1003.2-92 , 154and should be used with caution in software intended to be portable to 155other systems. 156.El 157.Pp 158When successful, 159.Fn regcomp 160returns 0 and fills in the structure pointed to by 161.Fa preg . 162One member of that structure (other than 163.Fa re_endp ) 164is publicized: 165.Fa re_nsub , 166of type 167.Fa size_t , 168contains the number of parenthesized subexpressions within the RE 169(except that the value of this member is undefined if the 170.Dv REG_NOSUB 171flag was used). 172If 173.Fn regcomp 174fails, it returns a non-zero error code; 175see 176.Sx DIAGNOSTICS . 177.Pp 178.Fn regexec 179matches the compiled RE pointed to by 180.Fa preg 181against the 182.Fa string , 183subject to the flags in 184.Fa eflags , 185and reports results using 186.Fa nmatch , 187.Fa pmatch , 188and the returned value. 189The RE must have been compiled by a previous invocation of 190.Fn regcomp . 191The compiled form is not altered during execution of 192.Fn regexec , 193so a single compiled RE can be used simultaneously by multiple threads. 194.Pp 195By default, 196the NUL-terminated string pointed to by 197.Fa string 198is considered to be the text of an entire line, minus any terminating 199newline. 200The 201.Fa eflags 202argument is the bitwise OR of zero or more of the following flags: 203.Bl -tag -width XXXREG_NOTBOL 204.It Dv REG_NOTBOL 205The first character of the string 206is not the beginning of a line, so the `^' anchor should not match before it. 207This does not affect the behavior of newlines under 208.Dv REG_NEWLINE . 209.It Dv REG_NOTEOL 210The NUL terminating the string does not end a line, so the `$' anchor 211should not match before it. 212This does not affect the behavior of newlines under 213.Dv REG_NEWLINE . 214.It Dv REG_STARTEND 215The string is considered to start at 216.Fa string 217+ 218.Fa pmatch[0].rm_so 219and to have a terminating NUL located at 220.Fa string 221+ 222.Fa pmatch[0].rm_eo 223(there need not actually be a NUL at that location), 224regardless of the value of 225.Fa nmatch . 226See below for the definition of 227.Fa pmatch 228and 229.Fa nmatch . 230This is an extension, compatible with but not specified by 231.St -p1003.2-92 , 232and should be used with caution in software intended to be portable to 233other systems. 234Note that a non-zero 235.Fa rm_so 236does not imply 237.Dv REG_NOTBOL ; 238.Dv REG_STARTEND 239affects only the location of the string, not how it is matched. 240.El 241.Pp 242See 243.Xr re_format 7 244for a discussion of what is matched in situations where an RE or a 245portion thereof could match any of several substrings of 246.Fa string . 247.Pp 248Normally, 249.Fn regexec 250returns 0 for success and the non-zero code 251.Dv REG_NOMATCH 252for failure. 253Other non-zero error codes may be returned in exceptional situations; 254see 255.Sx DIAGNOSTICS . 256.Pp 257If 258.Dv REG_NOSUB 259was specified in the compilation of the RE, or if 260.Fa nmatch 261is 0, 262.Fn regexec 263ignores the 264.Fa pmatch 265argument (but see below for the case where 266.Dv REG_STARTEND 267is specified). 268Otherwise, 269.Fa pmatch 270points to an array of 271.Fa nmatch 272structures of type 273.Fa regmatch_t . 274Such a structure has at least the members 275.Fa rm_so 276and 277.Fa rm_eo , 278both of type 279.Fa regoff_t 280(a signed arithmetic type at least as large as an 281.Fa off_t 282and a 283.Fa ssize_t ) , 284containing respectively the offset of the first character of a substring 285and the offset of the first character after the end of the substring. 286Offsets are measured from the beginning of the 287.Fa string 288argument given to 289.Fn regexec . 290An empty substring is denoted by equal offsets, 291both indicating the character following the empty substring. 292.Pp 293The 0th member of the 294.Fa pmatch 295array is filled in to indicate what substring of 296.Fa string 297was matched by the entire RE. 298Remaining members report what substring was matched by parenthesized 299subexpressions within the RE; 300member 301.Fa i 302reports subexpression 303.Fa i , 304with subexpressions counted (starting at 1) by the order of their 305opening parentheses in the RE, left to right. 306Unused entries in the array\(emcorresponding either to subexpressions that 307did not participate in the match at all, or to subexpressions that do not 308exist in the RE (that is, 309.Fa i 310\*[Gt] 311.Fa preg-\*[Gt]re_nsub ) 312\(emhave both 313.Fa rm_so 314and 315.Fa rm_eo 316set to -1. 317If a subexpression participated in the match several times, 318the reported substring is the last one it matched. 319(Note, as an example in particular, that when the RE `(b*)+' matches `bbb', 320the parenthesized subexpression matches each of the three `b's and then 321an infinite number of empty strings following the last `b', 322so the reported substring is one of the empties.) 323.Pp 324If 325.Dv REG_STARTEND 326is specified, 327.Fa pmatch 328must point to at least one 329.Fa regmatch_t 330(even if 331.Fa nmatch 332is 0 or 333.Dv REG_NOSUB 334was specified), 335to hold the input offsets for 336.Dv REG_STARTEND . 337Use for output is still entirely controlled by 338.Fa nmatch ; 339if 340.Fa nmatch 341is 0 or 342.Dv REG_NOSUB 343was specified, 344the value of 345.Fa pmatch [0] 346will not be changed by a successful 347.Fn regexec . 348.Pp 349.Fn regerror 350maps a non-zero 351.Fa errcode 352from either 353.Fn regcomp 354or 355.Fn regexec 356to a human-readable, printable message. 357If 358.Fa preg 359is non-NULL, 360the error code should have arisen from use of the 361.Fa regex_t 362pointed to by 363.Fa preg , 364and if the error code came from 365.Fn regcomp , 366it should have been the result from the most recent 367.Fn regcomp 368using that 369.Fa regex_t . ( 370.Fn regerror 371may be able to supply a more detailed message using information 372from the 373.Fa regex_t . ) 374.Fn regerror 375places the NUL-terminated message into the buffer pointed to by 376.Fa errbuf , 377limiting the length (including the NUL) to at most 378.Fa errbuf_size 379bytes. 380If the whole message won't fit, 381as much of it as will fit before the terminating NUL is supplied. 382In any case, 383the returned value is the size of buffer needed to hold the whole 384message (including terminating NUL). 385If 386.Fa errbuf_size 387is 0, 388.Fa errbuf 389is ignored but the return value is still correct. 390.Pp 391If the 392.Fa errcode 393given to 394.Fn regerror 395is first ORed with 396.Dv REG_ITOA , 397the ``message'' that results is the printable name of the error code, 398e.g. ``REG_NOMATCH'', 399rather than an explanation thereof. 400If 401.Fa errcode 402is 403.Dv REG_ATOI , 404then 405.Fa preg 406shall be non-NULL and the 407.Fa re_endp 408member of the structure it points to 409must point to the printable name of an error code; 410in this case, the result in 411.Fa errbuf 412is the decimal digits of 413the numeric value of the error code 414(0 if the name is not recognized). 415.Dv REG_ITOA 416and 417.Dv REG_ATOI 418are intended primarily as debugging facilities; 419they are extensions, compatible with but not specified by 420.St -p1003.2-92 , 421and should be used with caution in software intended to be portable to 422other systems. 423Be warned also that they are considered experimental and changes are possible. 424.Pp 425.Fn regfree 426frees any dynamically-allocated storage associated with the compiled RE 427pointed to by 428.Fa preg . 429The remaining 430.Fa regex_t 431is no longer a valid compiled RE 432and the effect of supplying it to 433.Fn regexec 434or 435.Fn regerror 436is undefined. 437.Pp 438None of these functions references global variables except for tables 439of constants; 440all are safe for use from multiple threads if the arguments are safe. 441.Sh IMPLEMENTATION CHOICES 442There are a number of decisions that 443.St -p1003.2-92 444leaves up to the implementor, 445either by explicitly saying ``undefined'' or by virtue of them being 446forbidden by the RE grammar. 447This implementation treats them as follows. 448.Pp 449See 450.Xr re_format 7 451for a discussion of the definition of case-independent matching. 452.Pp 453There is no particular limit on the length of REs, 454except insofar as memory is limited. 455Memory usage is approximately linear in RE size, and largely insensitive 456to RE complexity, except for bounded repetitions. 457See BUGS for one short RE using them 458that will run almost any system out of memory. 459.Pp 460A backslashed character other than one specifically given a magic meaning 461by 462.St -p1003.2-92 463(such magic meanings occur only in obsolete [``basic''] REs) 464is taken as an ordinary character. 465.Pp 466Any unmatched [ is a 467.Dv REG_EBRACK 468error. 469.Pp 470Equivalence classes cannot begin or end bracket-expression ranges. 471The endpoint of one range cannot begin another. 472.Pp 473.Dv RE_DUP_MAX , 474the limit on repetition counts in bounded repetitions, is 255. 475.Pp 476A repetition operator (?, *, +, or bounds) cannot follow another 477repetition operator. 478A repetition operator cannot begin an expression or subexpression 479or follow `^' or `|'. 480.Pp 481`|' cannot appear first or last in a (sub)expression or after another `|', 482i.e. an operand of `|' cannot be an empty subexpression. 483An empty parenthesized subexpression, `()', is legal and matches an 484empty (sub)string. 485An empty string is not a legal RE. 486.Pp 487A `{' followed by a digit is considered the beginning of bounds for a 488bounded repetition, which must then follow the syntax for bounds. 489A `{' \fInot\fR followed by a digit is considered an ordinary character. 490.Pp 491`^' and `$' beginning and ending subexpressions in obsolete (``basic'') 492REs are anchors, not ordinary characters. 493.Sh DIAGNOSTICS 494Non-zero error codes from 495.Fn regcomp 496and 497.Fn regexec 498include the following: 499.Pp 500.Bl -tag -width XXXREG_ECOLLATE -compact 501.It Dv REG_NOMATCH 502regexec() failed to match 503.It Dv REG_BADPAT 504invalid regular expression 505.It Dv REG_ECOLLATE 506invalid collating element 507.It Dv REG_ECTYPE 508invalid character class 509.It Dv REG_EESCAPE 510\e applied to unescapable character 511.It Dv REG_ESUBREG 512invalid backreference number 513.It Dv REG_EBRACK 514brackets [ ] not balanced 515.It Dv REG_EPAREN 516parentheses ( ) not balanced 517.It Dv REG_EBRACE 518braces { } not balanced 519.It Dv REG_BADBR 520invalid repetition count(s) in { } 521.It Dv REG_ERANGE 522invalid character range in [ ] 523.It Dv REG_ESPACE 524ran out of memory 525.It Dv REG_BADRPT 526?, *, or + operand invalid 527.It Dv REG_EMPTY 528empty (sub)expression 529.It Dv REG_ASSERT 530``can't happen''\(emyou found a bug 531.It Dv REG_INVARG 532invalid argument, e.g. negative-length string 533.El 534.Sh SEE ALSO 535.Xr grep 1 , 536.Xr sed 1 , 537.Xr re_format 7 538.Pp 539.St -p1003.2-92 , 540sections 2.8 (Regular Expression Notation) 541and 542B.5 (C Binding for Regular Expression Matching). 543.Sh HISTORY 544Originally written by Henry Spencer. 545Altered for inclusion in the 546.Bx 4.4 547distribution. 548.Sh BUGS 549This is an alpha release with known defects. 550Please report problems. 551.Pp 552There is one known functionality bug. 553The implementation of internationalization is incomplete: 554the locale is always assumed to be the default one of 555.St -p1003.2-92 , 556and only the collating elements etc. of that locale are available. 557.Pp 558The back-reference code is subtle and doubts linger about its correctness 559in complex cases. 560.Pp 561.Fn regexec 562performance is poor. 563This will improve with later releases. 564.Fa nmatch 565exceeding 0 is expensive; 566.Fa nmatch 567exceeding 1 is worse. 568.Fa regexec 569is largely insensitive to RE complexity 570.Em except 571that back references are massively expensive. 572RE length does matter; in particular, there is a strong speed bonus 573for keeping RE length under about 30 characters, 574with most special characters counting roughly double. 575.Pp 576.Fn regcomp 577implements bounded repetitions by macro expansion, 578which is costly in time and space if counts are large 579or bounded repetitions are nested. 580An RE like, say, 581`((((a{1,100}){1,100}){1,100}){1,100}){1,100}' 582will (eventually) run almost any existing machine out of swap space. 583.Pp 584There are suspected problems with response to obscure error conditions. 585Notably, 586certain kinds of internal overflow, 587produced only by truly enormous REs or by multiply nested bounded repetitions, 588are probably not handled well. 589.Pp 590Due to a mistake in 591.St -p1003.2-92 , 592things like `a)b' are legal REs because `)' is a special character 593only in the presence of a previous unmatched `('. 594This can't be fixed until the spec is fixed. 595.Pp 596The standard's definition of back references is vague. 597For example, does 598`a\e(\e(b\e)*\e2\e)*d' match `abbbd'? 599Until the standard is clarified, behavior in such cases should not be 600relied on. 601.Pp 602The implementation of word-boundary matching is a bit of a kludge, 603and bugs may lurk in combinations of word-boundary matching and anchoring. 604