xref: /386bsd/usr/share/man/cat3/regexp.0 (revision a2142627)
1REGEXP(3)                 386BSD Programmer's Manual                 REGEXP(3)
2
3NNAAMMEE
4     rreeggccoommpp, rreeggeexxeecc, rreeggssuubb, rreeggeerrrroorr - regular expression handlers
5
6SSYYNNOOPPSSIISS
7     ##iinncclluuddee <<rreeggeexxpp..hh>>
8
9     _r_e_g_e_x_p *
10     rreeggccoommpp(_c_o_n_s_t _c_h_a_r *_e_x_p)
11
12     _i_n_t
13     rreeggeexxeecc(_c_o_n_s_t _r_e_g_e_x_p *_p_r_o_g, _c_o_n_s_t _c_h_a_r *_s_t_r_i_n_g)
14
15     _v_o_i_d
16     rreeggssuubb(_c_o_n_s_t _r_e_g_e_x_p *_p_r_o_g, _c_o_n_s_t _c_h_a_r *_s_o_u_r_c_e, _c_h_a_r *_d_e_s_t)
17
18DDEESSCCRRIIPPTTIIOONN
19     The rreeggccoommpp(), rreeggeexxeecc(), rreeggssuubb(), and rreeggeerrrroorr() functions implement
20     egrep(1)-style  regular expressions and supporting facilities.
21
22     The rreeggccoommpp() function compiles a regular expression into a structure of
23     type regexp,  and returns a pointer to it.  The space has been allocated
24     using malloc(3) and may be released by free.
25
26     The rreeggeexxeecc() function matches a NUL-terminated _s_t_r_i_n_g against the
27     compiled regular expression in _p_r_o_g. It returns 1 for success and 0 for
28     failure, and adjusts the contents of _p_r_o_g's _s_t_a_r_t_p and _e_n_d_p (see below)
29     accordingly.
30
31     The members of a regexp structure include at least the following (not
32     necessarily in order):
33
34           char *startp[NSUBEXP];
35           char *endp[NSUBEXP];
36
37     where NSUBEXP is defined (as 10) in the header file.  Once a successful
38     rreeggeexxeecc() has been done using the rreeggeexxpp(), each _s_t_a_r_t_p- _e_n_d_p pair
39     describes one substring within the _s_t_r_i_n_g, with the _s_t_a_r_t_p pointing to
40     the first character of the substring and the _e_n_d_p pointing to the first
41     character following the substring.  The 0th substring is the substring of
42     _s_t_r_i_n_g that matched the whole regular expression.  The others are those
43     substrings that matched parenthesized expressions within the regular
44     expression, with parenthesized expressions numbered in left-to-right
45     order of their opening parentheses.
46
47     The rreeggssuubb() function copies _s_o_u_r_c_e to _d_e_s_t, making substitutions
48     according to the most recent rreeggeexxeecc() performed using _p_r_o_g. Each
49     instance of `&' in _s_o_u_r_c_e is replaced by the substring indicated by
50     _s_t_a_r_t_p[] and _e_n_d_p[]. Each instance of `\_n', where _n is a digit, is
51     replaced by the substring indicated by _s_t_a_r_t_p[_n] and _e_n_d_p[_n]. To get a
52     literal `&' or `\_n' into _d_e_s_t, prefix it with `\'; to get a literal `\'
53     preceding `&' or `\_n', prefix it with another `\'.
54
55     The rreeggeerrrroorr() function is called whenever an error is detected in
56     rreeggccoommpp(), rreeggeexxeecc(), or rreeggssuubb().  The default rreeggeerrrroorr() writes the
57     string _m_s_g, with a suitable indicator of origin, on the standard error
58     output and invokes exit(2).  The rreeggeerrrroorr() function can be replaced by
59     the user if other actions are desirable.
60
61RREEGGUULLAARR EEXXPPRREESSSSIIOONN SSYYNNTTAAXX
62     A regular expression is zero or more _b_r_a_n_c_h_e_s, separated by `|'.  It
63     matches anything that matches one of the branches.
64
65
66     A branch is zero or more _p_i_e_c_e_s, concatenated.  It matches a match for
67     the first, followed by a match for the second, etc.
68
69     A piece is an _a_t_o_m possibly followed by `*', `+', or `?'.  An atom
70     followed by `*' matches a sequence of 0 or more matches of the atom.  An
71     atom followed by `+' matches a sequence of 1 or more matches of the atom.
72     An atom followed by `?' matches a match of the atom, or the null string.
73
74     An atom is a regular expression in parentheses (matching a match for the
75     regular expression), a _r_a_n_g_e (see below), `.'  (matching any single
76     character), `^' (matching the null string at the beginning of the input
77     string), `$' (matching the null string at the end of the input string), a
78     `\' followed by a single character (matching that character), or a single
79     character with no other significance (matching that character).
80
81     A _r_a_n_g_e is a sequence of characters enclosed in `[]'.  It normally
82     matches any single character from the sequence.  If the sequence begins
83     with `^', it matches any single character _n_o_t from the rest of the
84     sequence.  If two characters in the sequence are separated by `-', this
85     is shorthand for the full list of ASCII characters between them (e.g.
86     `[0-9]' matches any decimal digit).  To include a literal `]' in the
87     sequence, make it the first character (following a possible `^').  To
88     include a literal `-', make it the first or last character.
89
90AAMMBBIIGGUUIITTYY
91     If a regular expression could match two different parts of the input
92     string, it will match the one which begins earliest.  If both begin in
93     the same place but match different lengths, or match the same length in
94     different ways, life gets messier, as follows.
95
96     In general, the possibilities in a list of branches are considered in
97     left-to-right order, the possibilities for `*', `+', and `?' are
98     considered longest-first, nested constructs are considered from the
99     outermost in, and concatenated constructs are considered leftmost-first.
100     The match that will be chosen is the one that uses the earliest
101     possibility in the first choice that has to be made.  If there is more
102     than one choice, the next will be made in the same manner (earliest
103     possibility) subject to the decision on the first choice.  And so forth.
104
105     For example, `(ab|a)b*c' could match `abc' in one of two ways.  The first
106     choice is between `ab' and `a'; since `ab' is earlier, and does lead to a
107     successful overall match, it is chosen.  Since the `b' is already spoken
108     for, the `b*' must match its last possibility-the empty string-since it
109     must respect the earlier choice.
110
111     In the particular case where no `|'s are present and there is only one
112     `*', `+', or `?', the net effect is that the longest possible match will
113     be chosen.  So `ab*', presented with `xabbbby', will match `abbbb'.  Note
114     that if `ab*', is tried against `xabyabbbz', it will match `ab' just
115     after `x', due to the begins-earliest rule.  (In effect, the decision on
116     where to start the match is the first choice to be made, hence subsequent
117     choices must respect it even if this leads them to less-preferred
118     alternatives.)
119
120RREETTUURRNN VVAALLUUEESS
121     The rreeggccoommpp() function returns NULL for a failure (rreeggeerrrroorr()
122     permitting), where failures are syntax errors, exceeding implementation
123     limits, or applying `+' or `*' to a possibly-null operand.
124
125SSEEEE AALLSSOO
126     ed(1),  ex(1),  expr(1),  egrep(1),  fgrep(1),  grep(1),  regex(3)
127
128HHIISSTTOORRYY
129     Both code and manual page for rreeggccoommpp(), rreeggeexxeecc(), rreeggssuubb(), and
130     rreeggeerrrroorr() were written at the University of Toronto and appeared in
131     4.3BSD-Tahoe. They are intended to be compatible with the Bell V8
132     regexp(3),  but are not derived from Bell code.
133
134BBUUGGSS
135     Empty branches and empty regular expressions are not portable to V8.
136
137     The restriction against applying `*' or `+' to a possibly-null operand is
138     an artifact of the simplistic implementation.
139
140     Does not support egrep's  newline-separated branches; neither does the V8
141     regexp(3),  though.
142
143     Due to emphasis on compactness and simplicity, it's not strikingly fast.
144     It does give special attention to handling simple cases quickly.
145
146BSD Experimental                April 19, 1991                               3
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199