1.\" Copyright 1991 The Regents of the University of California. 2.\" All rights reserved. 3.\" 4.\" %sccs.include.redist.man% 5.\" 6.\" @(#)regexp.3 5.3 (Berkeley) 08/05/92 7.\" 8.Dd 9.Dt REGEXP 3 10.Os 11.Sh NAME 12.Nm regcomp , 13.Nm regexec , 14.Nm regsub , 15.Nm regerror 16.Nd regular expression handlers 17.Sh SYNOPSIS 18.Fd #include <regexp.h> 19.Ft regexp * 20.Fn regcomp "const char *exp" 21.Ft int 22.Fn regexec "const regexp *prog" "const char *string" 23.Ft void 24.Fn regsub "const regexp *prog" "const char *source" "char *dest" 25.Sh DESCRIPTION 26This interface is made obsolete by 27.Xr regex 3 . 28.Pp 29The 30.Fn regcomp , 31.Fn regexec , 32.Fn regsub , 33and 34.Fn regerror 35functions 36implement 37.Xr egrep 1 Ns -style 38regular expressions and supporting facilities. 39.Pp 40The 41.Fn regcomp 42function 43compiles a regular expression into a structure of type 44.Xr regexp , 45and returns a pointer to it. 46The space has been allocated using 47.Xr malloc 3 48and may be released by 49.Xr free . 50.Pp 51The 52.Fn regexec 53function 54matches a 55.Dv NUL Ns -terminated 56.Fa string 57against the compiled regular expression 58in 59.Fa prog . 60It returns 1 for success and 0 for failure, and adjusts the contents of 61.Fa prog Ns 's 62.Em startp 63and 64.Em endp 65(see below) accordingly. 66.Pp 67The members of a 68.Xr regexp 69structure include at least the following (not necessarily in order): 70.Bd -literal -offset indent 71char *startp[NSUBEXP]; 72char *endp[NSUBEXP]; 73.Ed 74.Pp 75where 76.Dv NSUBEXP 77is defined (as 10) in the header file. 78Once a successful 79.Fn regexec 80has been done using the 81.Fn regexp , 82each 83.Em startp Ns - Em endp 84pair describes one substring 85within the 86.Fa string , 87with the 88.Em startp 89pointing to the first character of the substring and 90the 91.Em endp 92pointing to the first character following the substring. 93The 0th substring is the substring of 94.Fa string 95that matched the whole 96regular expression. 97The others are those substrings that matched parenthesized expressions 98within the regular expression, with parenthesized expressions numbered 99in left-to-right order of their opening parentheses. 100.Pp 101The 102.Fn regsub 103function 104copies 105.Fa source 106to 107.Fa dest , 108making substitutions according to the 109most recent 110.Fn regexec 111performed using 112.Fa prog . 113Each instance of `&' in 114.Fa source 115is replaced by the substring 116indicated by 117.Em startp Ns Bq 118and 119.Em endp Ns Bq . 120Each instance of 121.Sq \e Ns Em n , 122where 123.Em n 124is a digit, is replaced by 125the substring indicated by 126.Em startp Ns Bq Em n 127and 128.Em endp Ns Bq Em n . 129To get a literal `&' or 130.Sq \e Ns Em n 131into 132.Fa dest , 133prefix it with `\e'; 134to get a literal `\e' preceding `&' or 135.Sq \e Ns Em n , 136prefix it with 137another `\e'. 138.Pp 139The 140.Fn regerror 141function 142is called whenever an error is detected in 143.Fn regcomp , 144.Fn regexec , 145or 146.Fn regsub . 147The default 148.Fn regerror 149writes the string 150.Fa msg , 151with a suitable indicator of origin, 152on the standard 153error output 154and invokes 155.Xr exit 2 . 156The 157.Fn regerror 158function 159can be replaced by the user if other actions are desirable. 160.Sh REGULAR EXPRESSION SYNTAX 161A regular expression is zero or more 162.Em branches , 163separated by `|'. 164It matches anything that matches one of the branches. 165.Pp 166A branch is zero or more 167.Em pieces , 168concatenated. 169It matches a match for the first, followed by a match for the second, etc. 170.Pp 171A piece is an 172.Em atom 173possibly followed by `*', `+', or `?'. 174An atom followed by `*' matches a sequence of 0 or more matches of the atom. 175An atom followed by `+' matches a sequence of 1 or more matches of the atom. 176An atom followed by `?' matches a match of the atom, or the null string. 177.Pp 178An atom is a regular expression in parentheses (matching a match for the 179regular expression), a 180.Em range 181(see below), `.' 182(matching any single character), `^' (matching the null string at the 183beginning of the input string), `$' (matching the null string at the 184end of the input string), a `\e' followed by a single character (matching 185that character), or a single character with no other significance 186(matching that character). 187.Pp 188A 189.Em range 190is a sequence of characters enclosed in `[]'. 191It normally matches any single character from the sequence. 192If the sequence begins with `^', 193it matches any single character 194.Em not 195from the rest of the sequence. 196If two characters in the sequence are separated by `\-', this is shorthand 197for the full list of 198.Tn ASCII 199characters between them 200(e.g. `[0-9]' matches any decimal digit). 201To include a literal `]' in the sequence, make it the first character 202(following a possible `^'). 203To include a literal `\-', make it the first or last character. 204.Sh AMBIGUITY 205If a regular expression could match two different parts of the input string, 206it will match the one which begins earliest. 207If both begin in the same place but match different lengths, or match 208the same length in different ways, life gets messier, as follows. 209.Pp 210In general, the possibilities in a list of branches are considered in 211left-to-right order, the possibilities for `*', `+', and `?' are 212considered longest-first, nested constructs are considered from the 213outermost in, and concatenated constructs are considered leftmost-first. 214The match that will be chosen is the one that uses the earliest 215possibility in the first choice that has to be made. 216If there is more than one choice, the next will be made in the same manner 217(earliest possibility) subject to the decision on the first choice. 218And so forth. 219.Pp 220For example, 221.Sq Li (ab|a)b*c 222could match 223`abc' in one of two ways. 224The first choice is between `ab' and `a'; since `ab' is earlier, and does 225lead to a successful overall match, it is chosen. 226Since the `b' is already spoken for, 227the `b*' must match its last possibility\(emthe empty string\(emsince 228it must respect the earlier choice. 229.Pp 230In the particular case where no `|'s are present and there is only one 231`*', `+', or `?', the net effect is that the longest possible 232match will be chosen. 233So 234.Sq Li ab* , 235presented with `xabbbby', will match `abbbb'. 236Note that if 237.Sq Li ab* , 238is tried against `xabyabbbz', it 239will match `ab' just after `x', due to the begins-earliest rule. 240(In effect, the decision on where to start the match is the first choice 241to be made, hence subsequent choices must respect it even if this leads them 242to less-preferred alternatives.) 243.Sh RETURN VALUES 244The 245.Fn regcomp 246function 247returns 248.Dv NULL 249for a failure 250.Pf ( Fn regerror 251permitting), 252where failures are syntax errors, exceeding implementation limits, 253or applying `+' or `*' to a possibly-null operand. 254.Sh SEE ALSO 255.Xr ed 1 , 256.Xr ex 1 , 257.Xr expr 1 , 258.Xr egrep 1 , 259.Xr fgrep 1 , 260.Xr grep 1 , 261.Xr regex 3 262.Sh HISTORY 263Both code and manual page for 264.Fn regcomp , 265.Fn regexec , 266.Fn regsub , 267and 268.Fn regerror 269were written at the University of Toronto 270and appeared in 271.Bx 4.3 tahoe . 272They are intended to be compatible with the Bell V8 273.Xr regexp 3 , 274but are not derived from Bell code. 275.Sh BUGS 276Empty branches and empty regular expressions are not portable to V8. 277.Pp 278The restriction against 279applying `*' or `+' to a possibly-null operand is an artifact of the 280simplistic implementation. 281.Pp 282Does not support 283.Xr egrep Ns 's 284newline-separated branches; 285neither does the V8 286.Xr regexp 3 , 287though. 288.Pp 289Due to emphasis on 290compactness and simplicity, 291it's not strikingly fast. 292It does give special attention to handling simple cases quickly. 293