1.\" Copyright (c) 1991, 1993 2.\" The Regents of the University of California. All rights reserved. 3.\" 4.\" Redistribution and use in source and binary forms, with or without 5.\" modification, are permitted provided that the following conditions 6.\" are met: 7.\" 1. Redistributions of source code must retain the above copyright 8.\" notice, this list of conditions and the following disclaimer. 9.\" 2. Redistributions in binary form must reproduce the above copyright 10.\" notice, this list of conditions and the following disclaimer in the 11.\" documentation and/or other materials provided with the distribution. 12.\" 3. All advertising materials mentioning features or use of this software 13.\" must display the following acknowledgement: 14.\" This product includes software developed by the University of 15.\" California, Berkeley and its contributors. 16.\" 4. Neither the name of the University nor the names of its contributors 17.\" may be used to endorse or promote products derived from this software 18.\" without specific prior written permission. 19.\" 20.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 21.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 23.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 24.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 30.\" SUCH DAMAGE. 31.\" 32.\" from: @(#)regexp.3 8.1 (Berkeley) 6/4/93 33.\" $NetBSD: regexp.3,v 1.13 2002/02/07 09:24:07 ross Exp $ 34.\" 35.Dd June 4, 1993 36.Dt REGEXP 3 37.Os 38.Sh NAME 39.Nm regcomp , 40.Nm regexec , 41.Nm regsub , 42.Nm regerror 43.Nd regular expression handlers 44.Sh LIBRARY 45.Lb libcompat 46.Sh SYNOPSIS 47.Fd #include \*[Lt]regexp.h\*[Gt] 48.Ft regexp * 49.Fn regcomp "const char *exp" 50.Ft int 51.Fn regexec "const regexp *prog" "const char *string" 52.Ft void 53.Fn regsub "const regexp *prog" "const char *source" "char *dest" 54.Ft void 55.Fn regerror "const char *msg" 56.Sh DESCRIPTION 57.Bf -symbolic 58This interface is made obsolete by 59.Xr regex 3 . 60It is available from the compatibility library, libcompat. 61.Ef 62.Pp 63The 64.Fn regcomp , 65.Fn regexec , 66.Fn regsub , 67and 68.Fn regerror 69functions implement 70.Xr egrep 1 Ns -style 71regular expressions and supporting facilities. 72.Pp 73The 74.Fn regcomp 75function 76compiles a regular expression into a structure of type 77.Em regexp , 78and returns a pointer to it. 79The space has been allocated using 80.Xr malloc 3 81and may be released by 82.Xr free 3 . 83.Pp 84The 85.Fn regexec 86function 87matches a 88.Dv NUL Ns -terminated 89.Fa string 90against the compiled regular expression 91in 92.Fa prog . 93It returns 1 for success and 0 for failure, and adjusts the contents of 94.Fa prog Ns 's 95.Em startp 96and 97.Em endp 98(see below) accordingly. 99.Pp 100The members of a 101.Em regexp 102structure include at least the following (not necessarily in order): 103.Bd -literal -offset indent 104char *startp[NSUBEXP]; 105char *endp[NSUBEXP]; 106.Ed 107.Pp 108where 109.Dv NSUBEXP 110is defined (as 10) in the header file. 111Once a successful 112.Fn regexec 113has been done using the 114.Fn regexp , 115each 116.Em startp Ns - Em endp 117pair describes one substring 118within the 119.Fa string , 120with the 121.Em startp 122pointing to the first character of the substring and 123the 124.Em endp 125pointing to the first character following the substring. 126The 0th substring is the substring of 127.Fa string 128that matched the whole 129regular expression. 130The others are those substrings that matched parenthesized expressions 131within the regular expression, with parenthesized expressions numbered 132in left-to-right order of their opening parentheses. 133.Pp 134The 135.Fn regsub 136function 137copies 138.Fa source 139to 140.Fa dest , 141making substitutions according to the 142most recent 143.Fn regexec 144performed using 145.Fa prog . 146Each instance of `\*[Am]' in 147.Fa source 148is replaced by the substring 149indicated by 150.Em startp Ns Bq 151and 152.Em endp Ns Bq . 153Each instance of 154.Sq \e Ns Em n , 155where 156.Em n 157is a digit, is replaced by 158the substring indicated by 159.Em startp Ns Bq Em n 160and 161.Em endp Ns Bq Em n . 162To get a literal `\*[Am]' or 163.Sq \e Ns Em n 164into 165.Fa dest , 166prefix it with `\e'; 167to get a literal `\e' preceding `\*[Am]' or 168.Sq \e Ns Em n , 169prefix it with 170another `\e'. 171.Pp 172The 173.Fn regerror 174function 175is called whenever an error is detected in 176.Fn regcomp , 177.Fn regexec , 178or 179.Fn regsub . 180The default 181.Fn regerror 182writes the string 183.Fa msg , 184with a suitable indicator of origin, 185on the standard 186error output 187and invokes 188.Xr exit 3 . 189The 190.Fn regerror 191function 192can be replaced by the user if other actions are desirable. 193.Sh REGULAR EXPRESSION SYNTAX 194A regular expression is zero or more 195.Em branches , 196separated by `|'. 197It matches anything that matches one of the branches. 198.Pp 199A branch is zero or more 200.Em pieces , 201concatenated. 202It matches a match for the first, followed by a match for the second, etc. 203.Pp 204A piece is an 205.Em atom 206possibly followed by `*', `+', or `?'. 207An atom followed by `*' matches a sequence of 0 or more matches of the atom. 208An atom followed by `+' matches a sequence of 1 or more matches of the atom. 209An atom followed by `?' matches a match of the atom, or the null string. 210.Pp 211An atom is a regular expression in parentheses (matching a match for the 212regular expression), a 213.Em range 214(see below), `.' 215(matching any single character), `^' (matching the null string at the 216beginning of the input string), `$' (matching the null string at the 217end of the input string), a `\e' followed by a single character (matching 218that character), or a single character with no other significance 219(matching that character). 220.Pp 221A 222.Em range 223is a sequence of characters enclosed in `[]'. 224It normally matches any single character from the sequence. 225If the sequence begins with `^', 226it matches any single character 227.Em not 228from the rest of the sequence. 229If two characters in the sequence are separated by `\-', this is shorthand 230for the full list of 231.Tn ASCII 232characters between them 233(e.g. `[0-9]' matches any decimal digit). 234To include a literal `]' in the sequence, make it the first character 235(following a possible `^'). 236To include a literal `\-', make it the first or last character. 237.Sh AMBIGUITY 238If a regular expression could match two different parts of the input string, 239it will match the one which begins earliest. 240If both begin in the same place but match different lengths, or match 241the same length in different ways, life gets messier, as follows. 242.Pp 243In general, the possibilities in a list of branches are considered in 244left-to-right order, the possibilities for `*', `+', and `?' are 245considered longest-first, nested constructs are considered from the 246outermost in, and concatenated constructs are considered leftmost-first. 247The match that will be chosen is the one that uses the earliest 248possibility in the first choice that has to be made. 249If there is more than one choice, the next will be made in the same manner 250(earliest possibility) subject to the decision on the first choice. 251And so forth. 252.Pp 253For example, 254.Sq Li (ab|a)b*c 255could match 256`abc' in one of two ways. 257The first choice is between `ab' and `a'; since `ab' is earlier, and does 258lead to a successful overall match, it is chosen. 259Since the `b' is already spoken for, 260the `b*' must match its last possibility\(emthe empty string\(emsince 261it must respect the earlier choice. 262.Pp 263In the particular case where no `|'s are present and there is only one 264`*', `+', or `?', the net effect is that the longest possible 265match will be chosen. 266So 267.Sq Li ab* , 268presented with `xabbbby', will match `abbbb'. 269Note that if 270.Sq Li ab* , 271is tried against `xabyabbbz', it 272will match `ab' just after `x', due to the begins-earliest rule. 273(In effect, the decision on where to start the match is the first choice 274to be made, hence subsequent choices must respect it even if this leads them 275to less-preferred alternatives.) 276.Sh RETURN VALUES 277The 278.Fn regcomp 279function 280returns 281.Dv NULL 282for a failure 283.Pf ( Fn regerror 284permitting), 285where failures are syntax errors, exceeding implementation limits, 286or applying `+' or `*' to a possibly-null operand. 287.Sh SEE ALSO 288.Xr ed 1 , 289.Xr egrep 1 , 290.Xr ex 1 , 291.Xr expr 1 , 292.Xr fgrep 1 , 293.Xr grep 1 , 294.Xr regex 3 295.Sh HISTORY 296Both code and manual page for 297.Fn regcomp , 298.Fn regexec , 299.Fn regsub , 300and 301.Fn regerror 302were written at the University of Toronto 303and appeared in 304.Bx 4.3 tahoe . 305They are intended to be compatible with the Bell V8 306.Xr regexp 3 , 307but are not derived from Bell code. 308.Sh BUGS 309Empty branches and empty regular expressions are not portable to V8. 310.Pp 311The restriction against 312applying `*' or `+' to a possibly-null operand is an artifact of the 313simplistic implementation. 314.Pp 315Does not support 316.Xr egrep 1 Ns 's 317newline-separated branches; 318neither does the V8 319.Xr regexp 3 , 320though. 321.Pp 322Due to emphasis on 323compactness and simplicity, 324it's not strikingly fast. 325It does give special attention to handling simple cases quickly. 326