1.\" Copyright 1991 The Regents of the University of California. 2.\" All rights reserved. 3.\" 4.\" %sccs.include.redist.man% 5.\" 6.\" @(#)regexp.3 5.2 (Berkeley) 04/20/91 7.\" 8.Dd 9.Dt REGEXP 3 10.Os 11.Sh NAME 12.Nm regcomp , 13.Nm regexec , 14.Nm regsub , 15.Nm regerror 16.Nd regular expression handlers 17.Sh SYNOPSIS 18.Fd #include <regexp.h> 19.Ft regexp * 20.Fn regcomp "const char *exp" 21.Ft int 22.Fn regexec "const regexp *prog" "const char *string" 23.Ft void 24.Fn regsub "const regexp *prog" "const char *source" "char *dest" 25.Sh DESCRIPTION 26The 27.Fn regcomp , 28.Fn regexec , 29.Fn regsub , 30and 31.Fn regerror 32functions 33implement 34.Xr egrep 1 Ns -style 35regular expressions and supporting facilities. 36.Pp 37The 38.Fn regcomp 39function 40compiles a regular expression into a structure of type 41.Xr regexp , 42and returns a pointer to it. 43The space has been allocated using 44.Xr malloc 3 45and may be released by 46.Xr free . 47.Pp 48The 49.Fn regexec 50function 51matches a 52.Dv NUL Ns -terminated 53.Fa string 54against the compiled regular expression 55in 56.Fa prog . 57It returns 1 for success and 0 for failure, and adjusts the contents of 58.Fa prog Ns 's 59.Em startp 60and 61.Em endp 62(see below) accordingly. 63.Pp 64The members of a 65.Xr regexp 66structure include at least the following (not necessarily in order): 67.Bd -literal -offset indent 68char *startp[NSUBEXP]; 69char *endp[NSUBEXP]; 70.Ed 71.Pp 72where 73.Dv NSUBEXP 74is defined (as 10) in the header file. 75Once a successful 76.Fn regexec 77has been done using the 78.Fn regexp , 79each 80.Em startp Ns - Em endp 81pair describes one substring 82within the 83.Fa string , 84with the 85.Em startp 86pointing to the first character of the substring and 87the 88.Em endp 89pointing to the first character following the substring. 90The 0th substring is the substring of 91.Fa string 92that matched the whole 93regular expression. 94The others are those substrings that matched parenthesized expressions 95within the regular expression, with parenthesized expressions numbered 96in left-to-right order of their opening parentheses. 97.Pp 98The 99.Fn regsub 100function 101copies 102.Fa source 103to 104.Fa dest , 105making substitutions according to the 106most recent 107.Fn regexec 108performed using 109.Fa prog . 110Each instance of `&' in 111.Fa source 112is replaced by the substring 113indicated by 114.Em startp Ns Bq 115and 116.Em endp Ns Bq . 117Each instance of 118.Sq \e Ns Em n , 119where 120.Em n 121is a digit, is replaced by 122the substring indicated by 123.Em startp Ns Bq Em n 124and 125.Em endp Ns Bq Em n . 126To get a literal `&' or 127.Sq \e Ns Em n 128into 129.Fa dest , 130prefix it with `\e'; 131to get a literal `\e' preceding `&' or 132.Sq \e Ns Em n , 133prefix it with 134another `\e'. 135.Pp 136The 137.Fn regerror 138function 139is called whenever an error is detected in 140.Fn regcomp , 141.Fn regexec , 142or 143.Fn regsub . 144The default 145.Fn regerror 146writes the string 147.Fa msg , 148with a suitable indicator of origin, 149on the standard 150error output 151and invokes 152.Xr exit 2 . 153The 154.Fn regerror 155function 156can be replaced by the user if other actions are desirable. 157.Sh REGULAR EXPRESSION SYNTAX 158A regular expression is zero or more 159.Em branches , 160separated by `|'. 161It matches anything that matches one of the branches. 162.Pp 163A branch is zero or more 164.Em pieces , 165concatenated. 166It matches a match for the first, followed by a match for the second, etc. 167.Pp 168A piece is an 169.Em atom 170possibly followed by `*', `+', or `?'. 171An atom followed by `*' matches a sequence of 0 or more matches of the atom. 172An atom followed by `+' matches a sequence of 1 or more matches of the atom. 173An atom followed by `?' matches a match of the atom, or the null string. 174.Pp 175An atom is a regular expression in parentheses (matching a match for the 176regular expression), a 177.Em range 178(see below), `.' 179(matching any single character), `^' (matching the null string at the 180beginning of the input string), `$' (matching the null string at the 181end of the input string), a `\e' followed by a single character (matching 182that character), or a single character with no other significance 183(matching that character). 184.Pp 185A 186.Em range 187is a sequence of characters enclosed in `[]'. 188It normally matches any single character from the sequence. 189If the sequence begins with `^', 190it matches any single character 191.Em not 192from the rest of the sequence. 193If two characters in the sequence are separated by `\-', this is shorthand 194for the full list of 195.Tn ASCII 196characters between them 197(e.g. `[0-9]' matches any decimal digit). 198To include a literal `]' in the sequence, make it the first character 199(following a possible `^'). 200To include a literal `\-', make it the first or last character. 201.Sh AMBIGUITY 202If a regular expression could match two different parts of the input string, 203it will match the one which begins earliest. 204If both begin in the same place but match different lengths, or match 205the same length in different ways, life gets messier, as follows. 206.Pp 207In general, the possibilities in a list of branches are considered in 208left-to-right order, the possibilities for `*', `+', and `?' are 209considered longest-first, nested constructs are considered from the 210outermost in, and concatenated constructs are considered leftmost-first. 211The match that will be chosen is the one that uses the earliest 212possibility in the first choice that has to be made. 213If there is more than one choice, the next will be made in the same manner 214(earliest possibility) subject to the decision on the first choice. 215And so forth. 216.Pp 217For example, 218.Sq Li (ab|a)b*c 219could match 220`abc' in one of two ways. 221The first choice is between `ab' and `a'; since `ab' is earlier, and does 222lead to a successful overall match, it is chosen. 223Since the `b' is already spoken for, 224the `b*' must match its last possibility\(emthe empty string\(emsince 225it must respect the earlier choice. 226.Pp 227In the particular case where no `|'s are present and there is only one 228`*', `+', or `?', the net effect is that the longest possible 229match will be chosen. 230So 231.Sq Li ab* , 232presented with `xabbbby', will match `abbbb'. 233Note that if 234.Sq Li ab* , 235is tried against `xabyabbbz', it 236will match `ab' just after `x', due to the begins-earliest rule. 237(In effect, the decision on where to start the match is the first choice 238to be made, hence subsequent choices must respect it even if this leads them 239to less-preferred alternatives.) 240.Sh RETURN VALUES 241The 242.Fn regcomp 243function 244returns 245.Dv NULL 246for a failure 247.Pf ( Fn regerror 248permitting), 249where failures are syntax errors, exceeding implementation limits, 250or applying `+' or `*' to a possibly-null operand. 251.Sh SEE ALSO 252.Xr ed 1 , 253.Xr ex 1 , 254.Xr expr 1 , 255.Xr egrep 1 , 256.Xr fgrep 1 , 257.Xr grep 1 , 258.Xr regex 3 259.Sh HISTORY 260Both code and manual page for 261.Fn regcomp , 262.Fn regexec , 263.Fn regsub , 264and 265.Fn regerror 266were written at the University of Toronto 267and appeared in 268.Bx 4.3 tahoe . 269They are intended to be compatible with the Bell V8 270.Xr regexp 3 , 271but are not derived from Bell code. 272.Sh BUGS 273Empty branches and empty regular expressions are not portable to V8. 274.Pp 275The restriction against 276applying `*' or `+' to a possibly-null operand is an artifact of the 277simplistic implementation. 278.Pp 279Does not support 280.Xr egrep Ns 's 281newline-separated branches; 282neither does the V8 283.Xr regexp 3 , 284though. 285.Pp 286Due to emphasis on 287compactness and simplicity, 288it's not strikingly fast. 289It does give special attention to handling simple cases quickly. 290