1.\" $OpenBSD: regex.3,v 1.27 2016/05/26 05:46:44 martijn Exp $ 2.\" 3.\" Copyright (c) 1997, Phillip F Knaack. All rights reserved. 4.\" 5.\" Copyright (c) 1992, 1993, 1994 Henry Spencer. 6.\" Copyright (c) 1992, 1993, 1994 7.\" The Regents of the University of California. All rights reserved. 8.\" 9.\" This code is derived from software contributed to Berkeley by 10.\" Henry Spencer. 11.\" 12.\" Redistribution and use in source and binary forms, with or without 13.\" modification, are permitted provided that the following conditions 14.\" are met: 15.\" 1. Redistributions of source code must retain the above copyright 16.\" notice, this list of conditions and the following disclaimer. 17.\" 2. Redistributions in binary form must reproduce the above copyright 18.\" notice, this list of conditions and the following disclaimer in the 19.\" documentation and/or other materials provided with the distribution. 20.\" 3. Neither the name of the University nor the names of its contributors 21.\" may be used to endorse or promote products derived from this software 22.\" without specific prior written permission. 23.\" 24.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 25.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 26.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 27.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 28.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 29.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 30.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 31.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 32.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 33.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 34.\" SUCH DAMAGE. 35.\" 36.\" @(#)regex.3 8.4 (Berkeley) 3/20/94 37.\" 38.Dd $Mdocdate: May 26 2016 $ 39.Dt REGEXEC 3 40.Os 41.Sh NAME 42.Nm regcomp , 43.Nm regexec , 44.Nm regerror , 45.Nm regfree 46.Nd regular expression routines 47.Sh SYNOPSIS 48.In sys/types.h 49.In regex.h 50.Ft int 51.Fn regcomp "regex_t *preg" "const char *pattern" "int cflags" 52.Pp 53.Ft int 54.Fn regexec "const regex_t *preg" "const char *string" "size_t nmatch" \ 55 "regmatch_t pmatch[]" "int eflags" 56.Pp 57.Ft size_t 58.Fn regerror "int errcode" "const regex_t *preg" "char *errbuf" \ 59 "size_t errbuf_size" 60.Pp 61.Ft void 62.Fn regfree "regex_t *preg" 63.Sh DESCRIPTION 64These routines implement 65.St -p1003.2 66regular expressions 67.Pq Dq REs ; 68see 69.Xr re_format 7 . 70.Fn regcomp 71compiles an RE written as a string into an internal form, 72.Fn regexec 73matches that internal form against a string and reports results, 74.Fn regerror 75transforms error codes from either into human-readable messages, and 76.Fn regfree 77frees any dynamically allocated storage used by the internal form 78of an RE. 79.Pp 80The header 81.In regex.h 82declares two structure types, 83.Li regex_t 84and 85.Li regmatch_t , 86the former for compiled internal forms and the latter for match reporting. 87It also declares the four functions, 88a type 89.Li regoff_t , 90and a number of constants with names starting with 91.Dv REG_ . 92.Pp 93.Fn regcomp 94compiles the regular expression contained in the 95.Fa pattern 96string, 97subject to the flags in 98.Fa cflags , 99and places the results in the 100.Li regex_t 101structure pointed to by 102.Fa preg . 103The 104.Fa cflags 105argument is the bitwise 106.Tn OR 107of zero or more of the following values: 108.Bl -tag -width XREG_EXTENDEDX 109.It Dv REG_EXTENDED 110Compile modern 111.Pq Dq extended 112REs, 113rather than the obsolete 114.Pq Dq basic 115REs that are the default. 116.It Dv REG_BASIC 117This is a synonym for 0, 118provided as a counterpart to 119.Dv REG_EXTENDED 120to improve readability. 121.It Dv REG_NOSPEC 122Compile with recognition of all special characters turned off. 123All characters are thus considered ordinary, 124so the RE is a literal string. 125This is an extension, 126compatible with but not specified by 127.St -p1003.2 , 128and should be used with 129caution in software intended to be portable to other systems. 130.Dv REG_EXTENDED 131and 132.Dv REG_NOSPEC 133may not be used in the same call to 134.Fn regcomp . 135.It Dv REG_ICASE 136Compile for matching that ignores upper/lower case distinctions. 137See 138.Xr re_format 7 . 139.It Dv REG_NOSUB 140Compile for matching that need only report success or failure, 141not what was matched. 142.It Dv REG_NEWLINE 143Compile for newline-sensitive matching. 144By default, newline is a completely ordinary character with no special 145meaning in either REs or strings. 146With this flag, 147.Ql \&[^ 148bracket expressions and 149.Ql \&. 150never match newline, 151a 152.Ql ^ 153anchor matches the null string after any newline in the string 154in addition to its normal function, 155and the 156.Ql $ 157anchor matches the null string before any newline in the 158string in addition to its normal function. 159.It Dv REG_PEND 160The regular expression ends, 161not at the first NUL, 162but just before the character pointed to by the 163.Fa re_endp 164member of the structure pointed to by 165.Fa preg . 166The 167.Fa re_endp 168member is of type 169.Fa const\ char\ * . 170This flag permits inclusion of NULs in the RE; 171they are considered ordinary characters. 172This is an extension, 173compatible with but not specified by 174.St -p1003.2 , 175and should be used with 176caution in software intended to be portable to other systems. 177.El 178.Pp 179When successful, 180.Fn regcomp 181returns 0 and fills in the structure pointed to by 182.Fa preg . 183One member of that structure 184(other than 185.Fa re_endp ) 186is publicized: 187.Fa re_nsub , 188of type 189.Fa size_t , 190contains the number of parenthesized subexpressions within the RE 191(except that the value of this member is undefined if the 192.Dv REG_NOSUB 193flag was used). 194If 195.Fn regcomp 196fails, it returns a non-zero error code; 197see DIAGNOSTICS. 198.Pp 199.Fn regexec 200matches the compiled RE pointed to by 201.Fa preg 202against the 203.Fa string , 204subject to the flags in 205.Fa eflags , 206and reports results using 207.Fa nmatch , 208.Fa pmatch , 209and the returned value. 210The RE must have been compiled by a previous invocation of 211.Fn regcomp . 212The compiled form is not altered during execution of 213.Fn regexec , 214so a single compiled RE can be used simultaneously by multiple threads. 215.Pp 216By default, 217the NUL-terminated string pointed to by 218.Fa string 219is considered to be the text of an entire line, minus any terminating 220newline. 221The 222.Fa eflags 223argument is the bitwise 224.Tn OR 225of zero or more of the following values: 226.Bl -tag -width XREG_STARTENDX 227.It Dv REG_NOTBOL 228The first character of the string is treated as the continuation 229of a line. 230This means that the anchors 231.Ql ^ , 232.Ql [[:<:]] , 233and 234.Ql \e< 235do not match before it; but see 236.Dv REG_STARTEND 237below. 238This does not affect the behavior of newlines under 239.Dv REG_NEWLINE . 240.It Dv REG_NOTEOL 241The NUL terminating 242the string 243does not end a line, so the 244.Ql $ 245anchor does not match before it. 246This does not affect the behavior of newlines under 247.Dv REG_NEWLINE . 248.It Dv REG_STARTEND 249The string is considered to start at 250.Fa string No + 251.Fa pmatch Ns [0]. Ns Fa rm_so 252and to end before the byte located at 253.Fa string No + 254.Fa pmatch Ns [0]. Ns Fa rm_eo , 255regardless of the value of 256.Fa nmatch . 257See below for the definition of 258.Fa pmatch 259and 260.Fa nmatch . 261This is an extension, 262compatible with but not specified by 263.St -p1003.2 , 264and should be used with 265caution in software intended to be portable to other systems. 266.Pp 267Without 268.Dv REG_NOTBOL , 269the position 270.Fa rm_so 271is considered the beginning of a line, such that 272.Ql ^ 273matches before it, and the beginning of a word if there is a word 274character at this position, such that 275.Ql [[:<:]] 276and 277.Ql \e< 278match before it. 279.Pp 280With 281.Dv REG_NOTBOL , 282the character at position 283.Fa rm_so 284is treated as the continuation of a line, and if 285.Fa rm_so 286is greater than 0, the preceding character is taken into consideration. 287If the preceding character is a newline and the regular expression was compiled 288with 289.Dv REG_NEWLINE , 290.Ql ^ 291matches before the string; if the preceding character is not a word character 292but the string starts with a word character, 293.Ql [[:<:]] 294and 295.Ql \e< 296match before the string. 297.El 298.Pp 299See 300.Xr re_format 7 301for a discussion of what is matched in situations where an RE or a 302portion thereof could match any of several substrings of 303.Fa string . 304.Pp 305Normally, 306.Fn regexec 307returns 0 for success and the non-zero code 308.Dv REG_NOMATCH 309for failure. 310Other non-zero error codes may be returned in exceptional situations; 311see DIAGNOSTICS. 312.Pp 313If 314.Dv REG_NOSUB 315was specified in the compilation of the RE, 316or if 317.Fa nmatch 318is 0, 319.Fn regexec 320ignores the 321.Fa pmatch 322argument (but see below for the case where 323.Dv REG_STARTEND 324is specified). 325Otherwise, 326.Fa pmatch 327points to an array of 328.Fa nmatch 329structures of type 330.Li regmatch_t . 331Such a structure has at least the members 332.Fa rm_so 333and 334.Fa rm_eo , 335both of type 336.Fa regoff_t 337(a signed arithmetic type at least as large as an 338.Li off_t 339and a 340.Li ssize_t ) , 341containing respectively the offset of the first character of a substring 342and the offset of the first character after the end of the substring. 343Offsets are measured from the beginning of the 344.Fa string 345argument given to 346.Fn regexec . 347An empty substring is denoted by equal offsets, 348both indicating the character following the empty substring. 349.Pp 350The 0th member of the 351.Fa pmatch 352array is filled in to indicate what substring of 353.Fa string 354was matched by the entire RE. 355Remaining members report what substring was matched by parenthesized 356subexpressions within the RE; 357member 358.Va i 359reports subexpression 360.Va i , 361with subexpressions counted (starting at 1) by the order of their opening 362parentheses in the RE, left to right. 363Unused entries in the array\(emcorresponding either to subexpressions that 364did not participate in the match at all, or to subexpressions that do not 365exist in the RE (that is, \fIi\fR\ > \fIpreg\fR\->\fIre_nsub\fR)\(emhave both 366.Fa rm_so 367and 368.Fa rm_eo 369set to \-1. 370If a subexpression participated in the match several times, 371the reported substring is the last one it matched. 372(Note, as an example in particular, that when the RE 373.Dq (b*)+ 374matches 375.Dq bbb , 376the parenthesized subexpression matches each of the three 377.Sq b Ns s 378and then 379an infinite number of empty strings following the last 380.Sq b , 381so the reported substring is one of the empties.) 382.Pp 383If 384.Dv REG_STARTEND 385is specified, 386.Fa pmatch 387must point to at least one 388.Li regmatch_t 389(even if 390.Fa nmatch 391is 0 or 392.Dv REG_NOSUB 393was specified), 394to hold the input offsets for 395.Dv REG_STARTEND . 396Use for output is still entirely controlled by 397.Fa nmatch ; 398if 399.Fa nmatch 400is 0 or 401.Dv REG_NOSUB 402was specified, 403the value of 404.Fa pmatch[0] 405will not be changed by a successful 406.Fn regexec . 407.Pp 408.Fn regerror 409maps a non-zero 410.Va errcode 411from either 412.Fn regcomp 413or 414.Fn regexec 415to a human-readable, printable message. 416If 417.Fa preg 418is non-NULL, 419the error code should have arisen from use of 420the 421.Li regex_t 422pointed to by 423.Fa preg , 424and if the error code came from 425.Fn regcomp , 426it should have been the result from the most recent 427.Fn regcomp 428using that 429.Li regex_t . 430.Pf ( Fn regerror 431may be able to supply a more detailed message using information 432from the 433.Li regex_t . ) 434.Fn regerror 435places the NUL-terminated message into the buffer pointed to by 436.Fa errbuf , 437limiting the length (including the NUL) to at most 438.Fa errbuf_size 439bytes. 440If the whole message won't fit, 441as much of it as will fit before the terminating NUL is supplied. 442In any case, 443the returned value is the size of buffer needed to hold the whole 444message (including the terminating NUL). 445If 446.Fa errbuf_size 447is 0, 448.Fa errbuf 449is ignored but the return value is still correct. 450.Pp 451If the 452.Fa errcode 453given to 454.Fn regerror 455is first 456.Tn OR Ns 'ed 457with 458.Dv REG_ITOA , 459the 460.Dq message 461that results is the printable name of the error code, 462e.g., 463.Dq REG_NOMATCH , 464rather than an explanation thereof. 465If 466.Fa errcode 467is 468.Dv REG_ATOI , 469then 470.Fa preg 471shall be non-null and the 472.Fa re_endp 473member of the structure it points to 474must point to the printable name of an error code; 475in this case, the result in 476.Fa errbuf 477is the decimal digits of 478the numeric value of the error code 479(0 if the name is not recognized). 480.Dv REG_ITOA 481and 482.Dv REG_ATOI 483are intended primarily as debugging facilities; 484they are extensions, 485compatible with but not specified by 486.St -p1003.2 487and should be used with 488caution in software intended to be portable to other systems. 489Be warned also that they are considered experimental and changes are possible. 490.Pp 491.Fn regfree 492frees any dynamically allocated storage associated with the compiled RE 493pointed to by 494.Fa preg . 495The remaining 496.Li regex_t 497is no longer a valid compiled RE 498and the effect of supplying it to 499.Fn regexec 500or 501.Fn regerror 502is undefined. 503.Pp 504None of these functions references global variables except for tables 505of constants; 506all are safe for use from multiple threads if the arguments are safe. 507.Sh IMPLEMENTATION CHOICES 508There are a number of decisions that 509.St -p1003.2 510leaves up to the implementor, 511either by explicitly saying 512.Dq undefined 513or by virtue of them being 514forbidden by the RE grammar. 515This implementation treats them as follows. 516.Pp 517See 518.Xr re_format 7 519for a discussion of the definition of case-independent matching. 520.Pp 521There is no particular limit on the length of REs, 522except insofar as memory is limited. 523Memory usage is approximately linear in RE size, and largely insensitive 524to RE complexity, except for bounded repetitions. 525See 526.Sx BUGS 527for one short RE using them 528that will run almost any system out of memory. 529.Pp 530A backslashed character other than one specifically given a magic meaning 531by 532.St -p1003.2 533(such magic meanings occur only in obsolete REs) 534is taken as an ordinary character. 535.Pp 536Any unmatched 537.Ql \&[ 538is a 539.Dv REG_EBRACK 540error. 541.Pp 542Equivalence classes cannot begin or end bracket-expression ranges. 543The endpoint of one range cannot begin another. 544.Pp 545RE_DUP_MAX, the limit on repetition counts in bounded repetitions, is 255. 546.Pp 547A repetition operator (?, *, +, or bounds) cannot follow another 548repetition operator. 549A repetition operator cannot begin an expression or subexpression 550or follow 551.Ql ^ 552or 553.Ql | . 554.Pp 555A 556.Ql | 557cannot appear first or last in a (sub)expression, or after another 558.Ql | , 559i.e., an operand of 560.Ql | 561cannot be an empty subexpression. 562An empty parenthesized subexpression, 563.Ql \&(\&) , 564is legal and matches an 565empty (sub)string. 566An empty string is not a legal RE. 567.Pp 568A 569.Ql { 570followed by a digit is considered the beginning of bounds for a 571bounded repetition, which must then follow the syntax for bounds. 572A 573.Ql { 574.Em not 575followed by a digit is considered an ordinary character. 576.Pp 577.Ql ^ 578and 579.Ql $ 580beginning and ending subexpressions in obsolete 581.Pq Dq basic 582REs are anchors, not ordinary characters. 583.Sh DIAGNOSTICS 584Non-zero error codes from 585.Fn regcomp 586and 587.Fn regexec 588include the following: 589.Pp 590.Bl -tag -compact -width XREG_ECOLLATEX 591.It Er REG_NOMATCH 592regexec() failed to match 593.It Er REG_BADPAT 594invalid regular expression 595.It Er REG_ECOLLATE 596invalid collating element 597.It Er REG_ECTYPE 598invalid character class 599.It Er REG_EESCAPE 600\e applied to unescapable character 601.It Er REG_ESUBREG 602invalid backreference number 603.It Er REG_EBRACK 604brackets [ ] not balanced 605.It Er REG_EPAREN 606parentheses ( ) not balanced 607.It Er REG_EBRACE 608braces { } not balanced 609.It Er REG_BADBR 610invalid repetition count(s) in { } 611.It Er REG_ERANGE 612invalid character range in [ ] 613.It Er REG_ESPACE 614ran out of memory 615.It Er REG_BADRPT 616?, *, or + operand invalid 617.It Er REG_EMPTY 618empty (sub)expression 619.It Er REG_ASSERT 620.Dq can't happen 621\(emyou found a bug 622.It Er REG_INVARG 623invalid argument, e.g., negative-length string 624.El 625.Sh SEE ALSO 626.Xr grep 1 , 627.Xr re_format 7 628.Pp 629.St -p1003.2 , 630sections 2.8 (Regular Expression Notation) 631and 632B.5 (C Binding for Regular Expression Matching). 633.Sh HISTORY 634Predecessors to 635.Fn regcomp 636and 637.Fn regexec , 638.Fn regcmp 639and 640.Fn regex , 641first appeared in the Programmer's Workbench (PWB/UNIX). 642.Pp 643The present functions were originally written by Henry Spencer 644and altered for inclusion in the 645.Bx 4.4 646distribution. 647.Sh BUGS 648This is an alpha release with known defects. 649Please report problems. 650.Pp 651There is one known functionality bug. 652The implementation of internationalization is incomplete: 653the locale is always assumed to be the default one of 654.St -p1003.2 , 655and only the collating elements etc. of that locale are available. 656.Pp 657The back-reference code is subtle and doubts linger about its correctness 658in complex cases. 659.Pp 660.Fn regexec 661performance is poor. 662This will improve with later releases. 663.Fa nmatch 664exceeding 0 is expensive; 665.Fa nmatch 666exceeding 1 is worse. 667.Fn regexec 668is largely insensitive to RE complexity 669.Em except 670that back references are massively expensive. 671RE length does matter; in particular, there is a strong speed bonus 672for keeping RE length under about 30 characters, 673with most special characters counting roughly double. 674.Pp 675.Fn regcomp 676implements bounded repetitions by macro expansion, 677which is costly in time and space if counts are large 678or bounded repetitions are nested. 679A RE like, say, 680.Dq ((((a{1,100}){1,100}){1,100}){1,100}){1,100} 681will (eventually) run almost any existing machine out of swap space. 682.Pp 683There are suspected problems with response to obscure error conditions. 684Notably, 685certain kinds of internal overflow, 686produced only by truly enormous REs or by multiply nested bounded repetitions, 687are probably not handled well. 688.Pp 689Due to a mistake in 690.St -p1003.2 , 691things like 692.Ql a)b 693are legal REs because 694.Ql \&) 695is 696a special character only in the presence of a previous unmatched 697.Ql \&( . 698This can't be fixed until the spec is fixed. 699.Pp 700The standard's definition of back references is vague. 701For example, does 702.Dq a\e(\e(b\e)*\e2\e)*d 703match 704.Dq abbbd ? 705Until the standard is clarified, 706behavior in such cases should not be relied on. 707.Pp 708The implementation of word-boundary matching is a bit of a kludge, 709and bugs may lurk in combinations of word-boundary matching and anchoring. 710